The challenges of streaming real-time data

Although Gnip handles real-time streaming of data from a variety of social media sites, it’s best known as the official commercial provider of the Twitter activity stream.

Frankly, “stream” is a misnomer. “Fire hose,” the colloquial variation, better represents the torrent of data Twitter produces. That hose pumps out around 155 million tweets per day, and it’s all addressed at a sustained rate.

I recently spoke with Gnip CEO Jud Valeski (@jvaleski) about what it takes to manage Twitter’s flood of data and how the Internet’s architecture needs to adapt to real-time needs. Our interview follows.

The Internet wasn’t really built to handle a river of big data. What are the architectural challenges of running real-time data through these pipes?

Jud Valeski: The most significant challenge is rusty infrastructure. Just as with many massive infrastructure projects that the world has seen, adopted, and exploited (aqueducts, highways, power/energy grids), the connective tissue of the network becomes excruciatingly dated. We’re lucky to have gotten as far as we have on it. The capital build-outs on behalf of the telecommunications industry have yielded relatively low-bandwidth solutions laden with false advertising about true throughput. The upside is that highly transactional HTTP REST apps are relatively scalable in this environment and they “just work.” It isn’t until we get into heavy payload apps — video streaming, large-scale activity fire hoses like Twitter — that the deficiencies in today’s network get put in the spotlight. That’s when the pipes begin to burst.

We can redesign applications to create smaller activities/actions in order to reduce overall sizes. We can use tighter protocols/formats (Protocol Buffers for example), and compression to minimize sizes as well. However, with the ever-increasing usage of social networks generating more “activities,” we’re running into true pipe capacity limits, and those limits often come with very hard stops. Typical business-class network connections don’t come close to handling high volumes, and you can forget about consumer-class connections handling them.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

Beyond infrastructure issues, as engineers, the web app programming we’ve been doing over the past 15 years has taught us to build applications in a highly synchronous transactional manner. Because each HTTP transaction generally only lasts a second or so at most, it’s easy to digest and process many discrete chunks of data. However, the bastard stepchild of every HTTP lib’s “get()” routine that returns the complete result, is the “read()” routine that only gives you a poorly bounded chunk.

You would be shocked at the ratio of engineers who can’t build event-driven, asynchronous data processing applications, to those who can, yet this is a big part of this space. Lack of ecosystem knowledge around these kinds of programming primitives is a big problem. Many higher level abstractions exist for streaming HTTP apps, but they’re not industrial strength, and therefore you have to really know what’s going on to build your own.

Shifting back to infrastructure: Often the bigger issue plaguing the network itself is one of latency, not throughput. While data tends to move quickly once streaming connections are established, inevitable reconnects create gaps. The longer those connections take to stand up, the bigger the gaps. Run a traceroute to your favorite API and see how many hops you take. It’s not pretty. Latencies on the network are generally a function of router and gateway clutter, as our packets bounce across a dozen servers just to get to the main server and then back to the client.

How is Gnip addressing these issues?

Jud Valeski: On the infrastructure side, we are trying (successfully to-date) to use existing, relatively off the shelf, back plane network topologies in the cloud to build our systems. We live on EC2 Larges and XLs to ensure dedicated NICs in our clusters. That helps with the router and gateway clutter. We’re also working with Amazon to ensure seamless connection upgrades as volumes increase. These are use cases they actually want to solve at a platform level, so our incentives are nicely aligned. We also play at the IP-stack level to ensure packet transmission is optimized for constant high-volume streams.

Once total volumes move past standard inbound and outbound connection capabilities, we will be offering dedicated interconnects. However, those come at a very steep price for us and our volume customers.

All of this leads me to my real answer: Trimming the fat.

While a sweet spot for us is certainly high-volume data consumers, there are many folks who don’t want volume, they want coverage. Coverage of just the activities they care about; usually their customers’ brands or products. We take on the challenge of digesting and processing the high volume on inbound, and distill the stream down to just the bits our coverage customers desire. You may need 100% of the activities that mention “good food,” but that obviously isn’t 100% of a publisher’s fire hose. Processing high-velocity root streams on behalf of hundreds of customers without adversely impacting latency takes a lot of work. Today, that means good ol’-fashioned engineering.

What tools and infrastructure changes are needed to better handle big-data streaming?

Jud Valeski: “Big data” as we talk about it today has been slayed by lots of cool abstractions (e.g. Hadoop) that fit nicely into the way we think about the stack we all know and love. “Big streams,” on the other hand, challenge the parallelization primitives folks have been solving for “big data.” There’s very little overlap, unfortunately.

So, on the software solution side, better and more widely used frameworks are needed. Companies like BackType and Gnip pushing their current solutions onto the network for open refinement would be an awesome step forward. I’m intrigued by the prospect of BackType’s Storm project, and I’m looking forward to seeing more of it. More brains lead to better solutions.

We shouldn’t be giving CPU and network latency injection a second thought, but we have to. The code I write to process bits as they come off the wire — quickly — should just “go fast,” regardless of its complexity. That’s too hard today. It requires too much custom code.

On the infrastructure side of things, ISPs need to provide cheaper access to reliable fat pipes. If they don’t, software will outpace their lack of innovation. To be clear, they don’t get this and the software will lap them. You asked what I think we need, not what I think we’ll actually get.

This interview was edited and condensed.

Related:

The challenges of streaming real-time data

Jud Valeski on how Gnip handles the Twitter fire hose.

The Internet wasn’t really built to handle a river of big data. What are the architectural challenges of running real-time data through these pipes?

How is Gnip addressing these issues?

What tools and infrastructure changes are needed to better handle big-data streaming?

Get the O’Reilly Data Newsletter