Building Apache Kafka from scratch

At the heart of big data platforms are robust data flows that connect diverse data sources. Over the past few years, a new set of (mostly open source) software components have become critical to tackling data integration problems at scale. By now, many people have heard of tools like Hadoop, Spark, and NoSQL databases, but there are a number of lesser-known components that are “hidden” beneath the surface.

In my conversations with data engineers tasked with building data platforms, one tool stands out: Apache Kafka, a distributed messaging system that originated from LinkedIn. It’s used to synchronize data between systems and has emerged as an important component in real-time analytics.

Subscribe to the O’Reilly Data Show Podcast

iTunes, SoundCloud, RSS

In my travels over the past year, I’ve met engineers across many industries who use Apache Kafka in production. A few months ago, I sat down with O’Reilly author and Radar contributor Jay Kreps, a highly regarded data engineer and former technical lead for Online Data Infrastructure at LinkedIn, and most recently CEO/co-founder of Confluent. One of the things we talked about was how Apache Kafka came about:

“I was running the team that was working on Hadoop stuff at LinkedIn. Our assumption had been that we’d get Hadoop and we would work on these really cool algorithms, and we’d just have this little bullet item of getting the data in and getting all the data out for serving. Then that little bullet item ended up being so much of a pain, and even when we had data, the structure of that data made managing it such a problem.

“I lobbied for a long time internally to do something principled about this. Instead of throwing people at the problem, trying to do something principled around data. Internally, we had a bunch of ideas about this. At that point, we picked up a little bit more infrastructure knowledge, so we had ideas about logs and distributed logs, and logs and databases and how they’re used. We were really interested in coming up with some way we could make real-time subscribable data feeds and a way to solve some of our other problems with that.

“We eventually did do that, and totally, explicitly, had the problem we wanted to solve in mind. Then, of course, like most things, you think, “oh, it’ll take a few months,” and then three years later, you’re still working on it.”

Kreps and his team began by looking at existing tools and strategies for managing data flows. None addressed their needs completely. Over time they began thinking of building a system themselves — thus, Apache Kafka was born:

“I guess I would call it data integration problems or getting all the data flowing reliably. I was running that system, and it was increasingly apparent to me how much this problem existed across the organization and how we were at the mercy of these data flows. If that’s broken, then everything downstream is useless. So yeah, we went through all these pains trying to do different things. At a certain point, if you have an infrastructure background, you think about it and it’s like this problem isn’t that hard, we should be able to do this. That’s how you kind of trick yourself into working on some from-scratch solution.”

Just as many of the popular components of the big data ecosystem mature, new data sources and applications are starting to appear. The Internet of Things (IoT) will require tools for managing, storing, and analyzing event data. The data stack for the IoT will be built by data engineers who have come to depend on components like Apache Kafka:

“I don’t think we anticipated how much people would use it. I think part of that is just good timing if that data ended up being really essential. I think we understood that, but we underestimated the importance of this and the generality of it as a way to think about all your different types of data. I think that’s becoming more true.

“A lot of the things I’ve been talking about with people involve getting these devices hooked up to the Internet, and that’s going to be generating event data. I think the importance of that, and the importance not just in the Internet industry but elsewhere, was something we really didn’t understand.”

Listen to the full podcast in the player above or through SoundCloud or iTunes. Jay Kreps will be speaking at Strata+Hadoop World in San Jose this coming February.

This post is part of our ongoing exploration into Big Data Components.

Building Apache Kafka from scratch

In this episode of the O'Reilly Data Show Podcast, Jay Kreps talks about data integration, event data, and the Internet of Things.

Get the O’Reilly Data Newsletter