FEATURED STORY

The world beyond batch: Streaming 101

A high-level tour of modern data-processing concepts.

waterfall-283145

Editor’s note: This is the first post in a two-part series about the evolution of data processing, with a focus on streaming systems, unbounded data sets, and the future of big data.

Streaming data processing is a big deal in big data these days, and for good reasons. Amongst them:

  • Businesses crave ever more timely data, and switching to streaming is a good way to achieve lower latency.
  • The massive, unbounded data sets that are increasingly common in modern business are more easily tamed using a system designed for such never-ending volumes of data.
  • Processing data as they arrive spreads workloads out more evenly over time, yielding more consistent and predictable consumption of resources.

Despite this business-driven surge of interest in streaming, the majority of streaming systems in existence remain relatively immature compared to their batch brethren, which has resulted in a lot of exciting, active development in the space recently.

As someone who’s worked on massive-scale streaming systems at Google for the last five+ years (MillWheel, Cloud Dataflow), I’m delighted by this streaming zeitgeist, to say the least. I’m also interested in making sure that folks understand everything that streaming systems are capable of and how they are best put to use, particularly given the semantic gap that remains between most existing batch and streaming systems. To that end, the fine folks at O’Reilly have invited me to contribute a written rendition of my Say Goodbye to Batch talk from Strata + Hadoop World London 2015. Since I have quite a bit to cover, I’ll be splitting this across two separate posts:

  1. Streaming: This first post will cover some basic background information and clarify some terminology before diving into details about time domains and a high-level overview of common approaches to data processing, both batch and streaming.
  2. The Dataflow Model: The second post will consist primarily of a whirlwind tour of the unified batch + streaming model used by Cloud Dataflow, facilitated by a concrete example applied across a diverse set of use cases. After that, I’ll conclude with a brief semantic comparison of existing batch and streaming systems.

So, long-winded introductions out of the way, let’s get nerdy. Read more…

Comment
Four short links: 5 August 2015

Four short links: 5 August 2015

Facebook Video, Lost Links, Regulatory Push, and LLVM Teases

  1. Theft, Lies, and Facebook Video (Medium) — inexcusable that Facebook, a company with a market cap of $260 BILLION, launched their video platform with no system to protect independent rights holders. It wouldn’t be surprising if Facebook was working on a solution now, which they can roll out conveniently after having made their initial claims at being the biggest, most important thing in video. In the words of Gillian Welch, “I wanna do right, but not right now.
  2. The Web We Have to SaveNearly every social network now treats a link just the same as it treats any other object — the same as a photo, or a piece of text — instead of seeing it as a way to make that text richer. You’re encouraged to post one single hyperlink and expose it to a quasi-democratic process of liking and plussing and hearting: Adding several links to a piece of text is usually not allowed. Hyperlinks are objectivized, isolated, stripped of their powers.
  3. California Regulator Pushing for All Cars to be Electric (Bloomberg) — Nichols really does intend to force au­tomakers to eventually sell nothing but electrics. In an interview in June at her agency’s heavy-duty-truck laboratory in downtown Los Angeles, it becomes clear that Nichols, at age 70, is pushing regulations today that could by midcentury all but banish the internal combustion engine from California’s famous highways. “If we’re going to get our transportation system off petroleum,” she says, “we’ve got to get people used to a zero-emissions world, not just a little-bit-better version of the world they have now.” How long until the same article is written, but about driverless cars?
  4. LLVM for Grad Students — fast intro to why LLVM is interesting. LLVM is a great compiler, but who cares if you don’t do compilers research? A compiler infrastructure is useful whenever you need to do stuff with programs.
Comment

What it means to “go pro” in data science

A look at what it takes to be a professional data science programmer.

Noahs_Ark_Paul_K_FlickrMy experience of being a data scientist is not at all like what I’ve read in books and blogs. I’ve read about data scientists working for digital superstar companies. They sound like heroes writing automated (near sentient) algorithms constantly churning out insights. I’ve read about MacGyver-like data scientist hackers who save the day by cobbling together data products from whatever raw material they have around.

The data products my team creates are not important enough to justify huge enterprise-wide infrastructures. It’s just not worth it to invest in hyper-efficient automation and production control. On the other hand, our data products influence important decisions in the enterprise, and it’s important that our efforts scale. We can’t afford to do things manually all the time, and we need efficient ways of sharing results with tens of thousands of people.

There are a lot of us out there — the “regular” data scientists; we’re more organized than hackers but with no need for a superhero-style data science lair. A group of us met and held a speed ideation event, where we brainstormed on the best practices we need to write solid code. This article is a summary of the conversation and an attempt to collect our knowledge, distill it, and present it in one place. Read more…

Comment

Bluetooth LE has solved the 50% problem, cracking open the IoT

The O'Reilly Radar Podcast: Alasdair Allan on BLE, data from the Pluto flyby, and the future of "personal space programs."

Cracked_earth_in_the_Rann_of_Kutch_Vinod_Panicker

Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.

In this week’s O’Reilly Radar Podcast, O’Reilly’s Mac Slocum chats with Alasdair Allan, an astrophysicist and director at Babilim Light Industries. In their wide-ranging conversation, Allan talks about the data coming out of the New Horizons Pluto flyby, the future of “personal space programs,” and why Bluetooth LE (BLE) is cracking open the Internet of Things.

Here are a few highlights from their conversation:

The only thing Bluetooth LE shares with traditional Bluetooth is the name.

Bluetooth LE, now that Google Android also supports it, has solved the 50% problem. … Now that all of the smartphones in the world have Bluetooth LE, or at least the more modern ones, there is a very easy way to produce low-power devices — wearables, embedded sensors, all of that sort of stuff — that anyone can access with a smart phone.

The Internet of Things is neither about the Internet, nor really the things. I much prefer the academic term “ubiquitous computing,” but no one really seems to want to use that, which is somewhat unfortunate.

You don’t have to worry about power, and that can be a real lever to open up the wearables market in the same way the BLE was a lever to open the IoT market.

There are hardly any impact craters on the surface of Pluto, so that means that the surface itself is active. … Also, there’s these huge mountain ranges, three-and-a-half-thousand meters tall — and they’re pointy. There is no way the mountains on Pluto should be pointy.

Read more…

Comment

Understanding neural function and virtual reality

The O'Reilly Data Show Podcast: Poppy Crum explains that what matters is efficiency in identifying and emphasizing relevant data.

Neuron_like_trees_gomessda_flickr

Like many data scientists, I’m excited about advances in large-scale machine learning, particularly recent success stories in computer vision and speech recognition. But I’m also cognizant of the fact that press coverage tends to inflate what current systems can do, and their similarities to how the brain works.

During the latest episode of the O’Reilly Data Show Podcast, I had a chance to speak with Poppy Crum, a neuroscientist who gave a well-received keynote at Strata + Hadoop World in San Jose. She leads a research group at Dolby Labs and teaches a popular course at Stanford on Neuroplasticity in Musical Gaming. I wanted to get her take on AI and virtual reality systems, and hear about her experience building a team of researchers from diverse disciplines.

Understanding neural function

While it can sometimes be nice to mimic nature, in the case of the brain, machine learning researchers recognize that understanding and identifying the essential neural processes is much more critical. A related example cited by machine learning researchers is flight: wing flapping and feathers aren’t critical, but an understanding of physics and aerodynamics is essential.

Crum and other neuroscience researchers express the same sentiment. She points out that a more meaningful goal should be to “extract and integrate relevant neural processing strategies when applicable, but also identify where there may be opportunities to be more efficient.”

The goal in technology shouldn’t be to build algorithms that mimic neural function. Rather, it’s to understand neural function. … The brain is basically, in many cases, a Rube Goldberg machine. We’ve got this limited set of evolutionary building blocks that we are able to use to get to a sort of very complex end state. We need to be able to extract when that’s relevant and integrate relevant neural processing strategies when it’s applicable. We also want to be able to identify that there are opportunities to be more efficient and more relevant. I think of it as table manners. You have to know all the rules before you can break them. That’s the big difference between being really cool or being a complete heathen. The same thing kind of exists in this area. How we get to the end state, we may be able to compromise, but we absolutely need to be thinking about what matters in neural function for perception. From my world, where we can’t compromise is on the output. I really feel like we need a lot more work in this area. Read more…

Comment

Avoid design pitfalls in the IoT: Keep the focus on people

The O'Reilly Radar Podcast: Robert Brunner on IoT pitfalls, Ammunition, and the movement toward automation.

Subscribe to the O’Reilly Radar Podcast to track the technologies and people that will shape our world in the years to come.

Art_class_Paul_K_Flickr

For this week’s Radar Podcast, I had the opportunity to sit down with Robert Brunner, founder of the Ammunition design studio. Brunner talked about how design can help mitigate IoT pitfalls, what drove him to found Ammunition, and why he’s fascinated with design’s role in the movement toward automation.

Here are a few of the highlights from our chat:

One of the biggest pitfalls I’m seeing in how companies are approaching the Internet of Things, especially in the consumer market, is, literally, not paying attention to people — how people understand products and how they interact with them and what they mean to them.

It was this broader experience and understanding of what [a product] is and what it does in people’s lives, and what it means to them — that’s experienced not just through the thing, but how they learn about it, how they buy it, what happens when they open up the box, what happens when they use the product, what happens when the product breaks; all these things add up to how you feel about it and, ultimately, how you relate to a company. That was the foundation of [Ammunition].

Ultimately, I define design as the purposeful creation of things.

Read more…

Comment