"stream processing" entries
Two approaches to testing lambdafied code.
Over the past 18 months or so I’ve been talking to a lot of people about lambda expressions in Java 8. This isn’t that unusual when you’ve written a book on Java 8 and also run a training course on the topic! One of the questions I often get asked by people is how do lambda expressions alter how they test code? It’s an increasingly pertinent question in a world where more and more people have some kind of automated unit or regression test suite that runs over their project and when many people do Test Driven Development. Let’s explore some of the problems you may encounter when testing code that uses lambdas and streams and how to solve them.
Usually, when writing a unit test you call a method in your test code that gets called in your application. Given some inputs and possibly test doubles, you call these methods to test a certain behavior happening and then specify the changes you expect to result from this behavior.
Lambda expressions pose a slightly different challenge when unit testing code. Because they don’t have a name, it’s impossible to directly call them in your test code. You could choose to copy the body of the lambda expression into your test and then test that copy, but this approach has the unfortunate side effect of not actually testing the behavior of your implementation. If you change the implementation code, your test will still pass even though the implementation is performing a different task.
What do you get if you cross a distributed database with a stream processing system?
One of the concepts that has proven the hardest to explain to people when I talk about Samza is the idea of fault-tolerant local state for stream processing. I think people are so used to the idea of keeping all their data in remote databases that any departure from that seems unusual.
So, I wanted to give a little bit more motivation as to why we think local state is a fundamental primitive in stream processing.
What is state and why do you need it?
An easy way to understand state in stream processing is to think about the kinds of operations you might do in SQL. Imagine running SQL queries against a real-time stream of data. If your SQL query contains only filtering and single-row transformations (a simple
where clause, say), then it is stateless. That is, you can process a single row at a time without needing to remember anything in between rows. However, if your query involves aggregating many rows (a
group by) or joining together data from multiple streams, then it must maintain some state in between rows. If you are grouping data by some field and counting, then the state you maintain would be the counts that have accumulated so far in the window you are processing. If you are joining two streams, the state would be the rows in each stream waiting to find a match in the other stream.
The Lambda Architecture has its merits, but alternatives are worth exploring.
Nathan Marz wrote a popular blog post describing an idea he called the Lambda Architecture (“How to beat the CAP theorem“). The Lambda Architecture is an approach to building stream processing applications on top of MapReduce and Storm or similar systems. This has proven to be a surprisingly popular idea, with a dedicated website and an upcoming book. Since I’ve been involved in building out the real-time data processing infrastructure at LinkedIn using Kafka and Samza, I often get asked about the Lambda Architecture. I thought I would describe my thoughts and experiences.
What is a Lambda Architecture and how do I become one?
The Lambda Architecture looks something like this: