Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.
I first found myself having to learn Scala when I started using Spark (version 0.5). Prior to Spark, I’d peruse books on Scala but just never found an excuse to delve into it. In the early days of Spark, Scala was a necessity — I quickly came to appreciate it and have continued to use it enthusiastically.
For this Data Show Podcast, I spoke with O’Reilly author and Typesafe’s resident big data architect Dean Wampler about Scala and other programming languages, the big data ecosystem, and his recent interest in real-time applications. Dean has years of experience helping companies with large software projects, and over the last several years, he’s focused primarily on helping enterprises design and build big data applications.
Here are a few snippets from our conversation:
Apache Mesos & the big data ecosystem
It’s a very nice capability [of Spark] that you can actually run it on a laptop when you’re developing or working with smaller data sets. … But, of course, the real interesting part is to run on a cluster. You need some cluster infrastructure and, fortunately, it works very nicely with YARN. It works very nicely on the Hadoop ecosystem. … The nice thing about Mesos over YARN is that it’s a much more flexible, capable resource manager. It basically treats your cluster as one giant machine of resources and gives you that illusion, ignoring things like network latencies and stuff. You’re just working with a giant machine and it allocates resources to your jobs, multiple users, all that stuff, but because of its greater flexibility, it cannot only run things like Spark jobs, it can run services like HDFS or Cassandra or Kafka or any of these tools. … What I saw was there was a situation here where we had maybe a successor to YARN. It’s obviously not as mature an ecosystem as the Hadoop ecosystem but not everybody needs that maturity. Some people would rather have the flexibility of Mesos or of solving more focused problems.
I think it’s still early days, but I think the potential is there. In a way, it’s analogous to Spark in that it starts with some really good fundamental ideas and then builds on them. In Spark’s case, it would be the so-called resilient distributed data sets. With Tachyon, it’s basically an in-memory distributed file system — or a way to think of it, it’s like a distributed cache with file system semantics. What’s attractive about that is that you can basically have multiple applications accessing the same data sets and memory, accessing them through a file system, kind of API or a more proprietary API, but you get in-memory speeds with some configuration to do some durability. Behind the scenes, obviously you don’t want that data to get lost if the machine goes down. There’s facilities for having the data be backed to a file system, so I think it solves a number of interesting problems in big data applications like sharing data between running jobs, like giving you much more flexibility and performance characteristics. I think it’s pretty exciting.
Backpressure and reactive streams
Backpressure would be signaling from the consumer back to the producer, ‘Hey, I can’t take as much data as you’re feeding me, or I can actually take more data.’ It’s this protocol for controlling the rate of flow. And the reason this is important is because a classic way of implementing a connection between a producer and consumer is to put a buffer, like a queue in between them but then you have this dilemma. You could make it unbounded so that you never fill it up but the problem is memory is finite. Inevitably when you think about what’s going to happen in a stream system that runs for years, some weird situation … where the producer will just keep feeding data too fast to a consumer and it’ll eventually run out of memory and crash. You don’t like that but the flipside is, all right, make these things bounded buffers but you still haven’t completely solved your problems because then what do you do when that fills up? You end up arbitrarily dropping data or doing some other thing.
The idea with backpressure is, just have a negotiation happen out of band, like separate socket connection or something, where when the consumer can keep up, it’s just a push model. I just keep pushing data, but if the consumer gets backed up, then the consumer can signal, ‘All right, send me five more or send me 10 more,’ or whatever, that kind of thing, until [it] gets caught up.
[Reactive streams] is a standard for that backpressure mechanism [so that] if you get a directed graph of these things, then you can make strategic decisions at the beginning. If I’ve got data coming into this system and I’m getting backpressure, at least I can make a strategic decision about what to do.
- Why the data center needs an operating system by Benjamin Hindman, creator of Apache Mesos
- Introduction to Tachyon and a deep dive into Baidu’s production use case: a recent O’Reilly webcast co-presented by Haoyuan Li, the co-creator of Tachyon.
- Showcasing the real-time processing revival: Tools and learning resources for building intelligent, real-time products (sessions at Strata+Hadoop World NYC)
- Apache Spark: Powering applications on-premise and in the cloud, a Data Show episode featuring Spark’s release manager, Patrick Wendell.
Image on article and category pages by L Rempe-Gillen on Wikimedia Commons.