"The O’Reilly Data Show Podcast" entries

Jai Ranganathan on architecting big data applications in the cloud

The O’Reilly Data Show podcast: The Hadoop ecosystem, the recent surge in interest in all things real time, and developments in hardware.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px_Handy_notes_and_queries_(1887)_(14760632446)

Given the quick pace of innovation in the data ecosystem, we like to take a step back from the details of individual components, architecture, and applications, in order to take a wider view of the landscape of big data. This allows us to evaluate the progress of technology and infrastructure along the way, shifting our attention from the details of individual components like Spark and Kafka, to larger trends.

Some of the larger trends we’ve been exploring include the capabilities of distributed machine learning and the tradeoffs and design decisions involved in cloud architecture and stream processing.

In this episode of the O’Reilly Data Show, I sat down with Jai Ranganathan, senior director of product management at Cloudera. We talked about the trends in the Hadoop ecosystem, cloud computing, the recent surge in interest in all things real time, and hardware trends:

Large-scale machine learning

This sounds a bit like this should already exist in really good form right now, but one of the things that I’m really interested in is expanding the set of capabilities for distributed machine learning. While there are systems out there today that do do this, I think relative to what you can experience from a singular environment learning scikit-learn or R, the set of things you can do in a distributed fashion is limited. …  It’s not easy to distribute various algorithms and model-building techniques. I think there is still a lot of work for us to do to improve that experience. … And I do want to have good open source options like MLlib. MLlib may be the right answer. I would be perfectly happy if that’s the final answer, but we do need systems just to provide the kind of depth that you typically are used to in the singular environment. That’s just a matter of time and investment because these are non-trivial problems, but they are things that people are working on.

Read more…

Building enterprise data applications with open source components

The O’Reilly Data Show podcast: Dean Wampler on bounded and unbounded data processing and analytics.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

350px_Julia_set_of_a_hyperbolic_exponential_map_with_an_attracting_orbit_of_period_3

I first found myself having to learn Scala when I started using Spark (version 0.5). Prior to Spark, I’d peruse books on Scala but just never found an excuse to delve into it. In the early days of Spark, Scala was a necessity — I quickly came to appreciate it and have continued to use it enthusiastically.

For this Data Show Podcast, I spoke with O’Reilly author and Typesafe’s resident big data architect Dean Wampler about Scala and other programming languages, the big data ecosystem, and his recent interest in real-time applications. Dean has years of experience helping companies with large software projects, and over the last several years, he’s focused primarily on helping enterprises design and build big data applications.

Here are a few snippets from our conversation:

Apache Mesos & the big data ecosystem

It’s a very nice capability [of Spark] that you can actually run it on a laptop when you’re developing or working with smaller data sets. … But, of course, the real interesting part is to run on a cluster. You need some cluster infrastructure and, fortunately, it works very nicely with YARN. It works very nicely on the Hadoop ecosystem. … The nice thing about Mesos over YARN is that it’s a much more flexible, capable resource manager. It basically treats your cluster as one giant machine of resources and gives you that illusion, ignoring things like network latencies and stuff. You’re just working with a giant machine and it allocates resources to your jobs, multiple users, all that stuff, but because of its greater flexibility, it cannot only run things like Spark jobs, it can run services like HDFS or Cassandra or Kafka or any of these tools. … What I saw was there was a situation here where we had maybe a successor to YARN. It’s obviously not as mature an ecosystem as the Hadoop ecosystem but not everybody needs that maturity. Some people would rather have the flexibility of Mesos or of solving more focused problems.

Read more…

From search to distributed computing to large-scale information extraction

The O'Reilly Data Show Podcast: Mike Cafarella on the early days of Hadoop/HBase and progress in structured data extraction.

Subscribe to the O’Reilly Data Show Podcast to explore the opportunities and techniques driving big data and data science.

The_Wonders_of_The_World_British_Library_FlickrFebruary 2016 marks the 10th anniversary of Hadoop — at a point in time when many IT organizations actively use Hadoop, and/or one of the open source, big data projects that originated after, and in some cases, depend on it.

During the latest episode of the O’Reilly Data Show Podcast, I had an extended conversation with Mike Cafarella, assistant professor of computer science at the University of Michigan. Along with Strata + Hadoop World program chair Doug Cutting, Cafarella is the co-founder of both Hadoop and Nutch. In addition, Cafarella was the first contributor to HBase

We talked about the origins of Nutch, Hadoop (HDFS, MapReduce), HBase, and his decision to pursue an academic career and step away from these projects. Cafarella’s pioneering contributions to open source search and distributed systems fits neatly with his work in information extraction. We discussed a new startup he recently co-founded, ClearCutAnalytics, to commercialize a highly regarded academic project for structured data extraction (full disclosure: I’m an advisor to ClearCutAnalytics). As I noted in a previous post, information extraction (from a variety of data types and sources) is an exciting area that will lead to the discovery of new features (i.e., variables) that may end up improving many existing machine learning systems. Read more…