Improving on the Lambda Architecture for streaming analysis

Modern organizations have started pushing their big data initiatives beyond historical analysis. Fast data creates big data, and applications are being developed that capture value, specifically real-time analytics, the moment fast data arrives. The need for real-time analysis of streaming data for real-time analytics, alerting, customer engagement or other on-the-spot decision-making, is converging on a layered software setup called the Lambda Architecture.

The Lambda Architecture, a collection of both big and fast data software components, is a software paradigm designed to capture value, specifically analytics, from not only historical data, but also from data that is streaming into the system.

In this article, I’ll explain the challenges that this architecture currently presents and explore some of the weaknesses. I’ll also discuss an alternative architecture using an in-memory database that can simplify and extend the capabilities of Lambda. Some of the enhancements to the Lambda Architecture that will be discussed are:

The ability to return real-time, low-latency responses back to the originating system for immediate actions, such as customer-tailored responses. Data doesn’t have to only flow one way, into the system.
The addition of a transactional, consistent (ACID) data store. Data entering the system can be transactional and operated upon in a consistent manner.
The addition of SQL support to the speed layer, providing support for ad hoc analytics as well as support for standard SQL report tooling.

What is Lambda?

The Lambda Architecture is an emerging big data architecture designed to ingest, process, and compute analytics on both fresh (real-time) and historical (batch) data together. In his book Big Data — Principles and Best Practices of Scalable Realtime Data Systems, Nathan Marz introduces the Lambda Architecture and states that:

“The Lambda Architecture…provides a general-purpose approach to implementing an arbitrary function on an arbitrary data set and having the function return its results with low latency.”

Marz further defines three key layers in the Lambda Architecture:

Batch layer
This is the historical archive used to hold all of the data ever collected. This is usually a “data lake” system, such as Hadoop, though it could also be an online analytical processing (OLAP) data warehouse like Vertica or Netezza. The batch layer supports batch queries, which compute historical predefined and ad hoc analytics.
Speed layer
The speed layer supports computing real-time analytics on fast-moving data as it enters the system. This layer is a combination of queuing, streaming, and operational data stores.

Like the batch layer, the speed layer computes analytics — except that the speed layer runs computations in real time, on fresh data, as the data enters the system. Its purpose is to compute these analytics quickly, at low latency. The analytics the batch layer calculates, for example, are performed over a larger, slightly older data set and take significantly longer to compute. If you relied solely on the batch layer to ingest and compute analytics, the speed at which the batch layer computes the results would mean that the results would likely be minutes to an hour old. It is the speed layer’s responsibility to calculate real-time analytics based on fast-moving data, such as data that is zero to one hour old. Thus, when you query the system, you can get a complete view of analytics across the most recent data and all historical data.
Serving layer
This layer caches results from batch-layer computations so they are immediately available to answer queries. Computing batch layer queries can take time. Periodically, these analytics are re-computed and the cached results are refreshed in the serving layer.

To summarize, Lambda defines a big data architecture that allows pre-defined and arbitrary queries and computations on both fast-moving data and historical data.

Common Lambda applications

New applications for the Lambda Architecture are emerging seemingly weekly. Some of the more common use cases of Lambda-based applications revolve around log ingestion and analytics on those log messages. “Logs” in this context could be general server log messages, website clickstream logging, VPN access logs, or the popular practice of collecting analytics on Twitter streams.

The architecture improves on present-day architectures by being able to capture analytics on fast-moving data as it enters the system. This data, which is immutable, is ingested by both Lambda’s speed layer and batch layer, usually in parallel, by way of message queues and streaming systems, such as Kafka and Storm. The ingestion of each log message does not require a response to the entity that delivered the data — it is a one-way data pipeline.

A log message’s final resting place is the data lake, where batch metrics are (re)computed. The speed layer computes similar results for the most recent “window,” staying ahead of the Hadoop/batch layer. Thus, the Lambda Architecture allows applications to take recent data into account but supports the same basic applications as batch analytics — not real-time decision-making, such as determining which ad or promotion to serve to a visitor.

Analytics at the speed and batch layer can be predefined or ad hoc. Should new analytics be desired in the Lambda Architecture, the application could rerun the entire data set, from the data lake or from the original log files, to recompute the new metrics. For example, analytics for website click logs could count page hits and page popularity. For Tweet streams, it could compute trending topics.

Limitations of the Lambda Architecture

Although it represents an advance in data analysis and exploits many modern tools well, Lambda falls short in a few ways:

One-way data flow
In Lambda, immutable data flows in one direction: into the system. The architecture’s main goal is to execute OLAP-type processing faster — in essence, reducing the time required to consult column-stored data from a couple of seconds to about 100ms.

Therefore, the Lambda Architecture doesn’t achieve some of the potentially valuable applications of real-time analytics, such as user segmentation and scoring, fraud detection, detecting denial of service attacks, and calculating consumer policies and billing. Lambda doesn’t transact and make per-event decisions on the streaming data, nor does it respond immediately to the events coming in.
Eventual consistency
Although adequate for popular consumer applications such as displaying status messages, the eventual consistency that solves the well-known CAP dilemma is less robust than the transactions offered by relational databases and some NoSQL products. More important, reliance on eventual consistency makes it impossible to feed data quickly back into the batch layer and alter analytics on the fly.
NoSQL
Most tools require custom coding to their unique APIs instead of allowing well-understood SQL queries or the use of common tools, such as business intelligence.
Complexity
The Lambda Architecture is currently composed of many disparate components passing messages from one to the next. This complexity gets in the way of making instant decisions on real-time data. An oft-cited blog posting from last year explains this weakness.

One company attempted to solve a streaming data problem by implementing the Lambda Architecture as follows:

The speed layer enlisted Kafka for ingestion, Storm for processing, Cassandra for state, and Zookeeper for distributed coordination.
The batch layer loaded tuples in batches into S3, then processed the data with Cascading and Amazon Elastic MapReduce.
The serving layer employed a key/value store such as ElephantDB.

Each component required at least three nodes; the speed layer alone needed 12 nodes. For a situation requiring high speed and accuracy, the company implemented a fragile, complex infrastructure.

In-memory databases can be designed to fill the gaps left by the Lambda Architecture. I’ll finish this article by looking at a solution involving in-memory databases, using VoltDB as a model.

Simplifying the Lambda Architecture

The Lambda Architecture can be simplified, preserving its key virtues while enabling missing functionality by replacing the complex speed layer and part of the batch layer with a suitable distributed in-memory database.

VoltDB, for instance, is a clustered, in-memory, relational database that supports the fast ingest of data, real-time ad hoc analytics, and the rapid export of data to downstream systems such as Hadoop and OLAP offerings. A fast relational database fits squarely and solidly in the Lambda Architecture’s speed layer — provided the database is fast enough. Like popular streaming systems, VoltDB is horizontally scalable, highly available, and fault tolerant, all while sustaining transactional ingestion speeds of hundreds of thousands to millions of events per second. In the standard Lambda Architecture, the inclusion of this single component greatly simplifies the speed layer by replacing both its streaming and operational data store portions.

In this revised architecture, a queuing system such as Kafka feeds both VoltDB and Hadoop, or the database directly, which would then in turn immediately export the event to the data lake.

Applications that make use of in-memory capabilities

As defined today, the Lambda Architecture is very focused on fast data collection and read-only queries on both fast and historical data. In Lambda, data is immutable. External systems make use of the Lambda-based environment to query the computed analytics. These analytics are then used for alerts (should metric thresholds be crossed), or harvested, for example in the case of Twitter trending topics.

When considering improvements to the Lambda Architecture, what if you could react, per event, to the incoming data stream? In essence, you’d have the ability to take action based on the incoming feed, in addition to performing analytics.

Many developers are building streaming, fast data applications using the clustered, in-memory, relational database approach suggested by VoltDB. These systems ingest events from sources such as log files, the Internet of Things (IoT), user clickstreams, online game play, and financial applications. While some of these applications passively ingest events and provide real-time analytics and alerting on the data streams (in typical Lambda style), many have begun interacting with the stream, adding per-event decision-making and transactions in addition to real-time analytics.

Additionally, in these systems, the speed layer’s analytics can differ from the batch layer’s analytics. Often, the data lake is used to mine intelligence via exploratory queries. This intelligence, when identified, is then fed to the speed layer as input to the per-event decisions. In this revised architecture:

Data arrives at a high rate and is ingested. It is immediately exported to the batch layer.
Historical intelligence can be mined from the batch layer and the aggregate “intelligence” can be delivered to the speed layer for per-event real-time decision-making (for instance, to determine which ad to display for a segmented/categorized web browser/user).
Fast data is either passively ingested, or a response can be computed by the new decision-making layer, using both real-time data and historical “mined” intelligence.

A blog posting from VoltDB offers an overview and example of this fully interactive Lambda-like approach. Another VoltDB resource offers code for a working speed layer.

Conclusion

The Lambda Architecture is a powerful big data analytics framework that serves queries from both fast and historical data. However, the architecture emerged from a need to execute OLAP-type processing faster, without considering a new class of applications that require real-time, per-event decision-making. In its current form, Lambda is limited: immutable data flows in one direction, into the system, for analytics harvesting.

Using a fast in-memory scalable relational database in the Lambda Architecture greatly simplifies the speed layer by reducing the number of components needed.

Lambda’s shortcoming is the inability to build responsive, event-oriented applications. In addition to simplifying the architecture, an in-memory, scale-out relational database lets organizations execute transactions and per-event decisions on fast data as it arrives. In contrast to the one-way streaming system feeding events into the speed layer, using a fast database as an ingestion engine provides developers with the ability to place applications in front of the event stream. This lets applications capture value the moment the event arrives, rather than capturing value at some point after the event arrives on an aggregate-basis.

This approach improves the Lambda Architecture by:

Reducing the number of moving pieces — the products and components. Specifically, major components of the speed layer can be replaced by a single component. Further, the database can be used as a data store for the serving layer.
Letting an application make per-event decision-making and transactional behavior.
Providing the traditional relational database interaction model, with both ad hoc SQL capabilities and Java on fast data. Applications can use familiar standard SQL, providing agility to their query needs without requiring complex programming logic. Applications can also use standard analytics tooling, such as Tableau, MicroStrategy, and Actuate BIRT, on top of fast data.

This post is part of a collaboration between O’Reilly and VoltDB. See our statement of editorial independence.