"Hadoop" entries

Navigating the Hadoop ecosystem

A field guide to the Apache Hadoop projects, subprojects, and related technologies.

Marshall Presser, co-author of Field Guide to Hadoop, contributed to this post.

IT managers, developers, data analysts, and system architects are encountering the largest and most disruptive change in data analysis since the ascendency of the relational database in early 1980s — the challenge to process, organize, and take full advantage of big data.  With 73% of organizations making big data investments in 2014 and 2015, this transition is occurring at a historic pace, requiring new ways of thinking to go along with new tools and techniques.

Hadoop is the cornerstone of this change to a landscape of systems and skills we’ve traditionally possessed. In the nine short years since the project revolutionized data science at Yahoo!, an entire ecosystem of technologies has sprung up around it. While the power of this ecosystem is plain to see, it can be a challenge to navigate your way through the complex and rapidly evolving collection of projects and products.

A couple years ago, my coworker Marshall Presser and I started our journey into the world of Hadoop. Like many folks, we found the company we worked for was making a major investment in the Hadoop ecosystem, and we had to find a way to adapt. We started in all of the typical places — blog posts, trade publications, Wikipedia articles, and project documentation. Quickly, we learned that many of these sources are often highly biased, either too shallow or too deep, and just plain inconsistent. Read more…

Comment: 1
Four short links: 16 February 2015

Four short links: 16 February 2015

Grace Hopper, Car Dashboard UIs, DAO Governance, and Sahale.

  1. The Queen of Code — 12m documentary on Grace Hopper, produced by fivethirtyeight.com.
  2. Car Dashboard UI Collection — inspiration board for your (data) dashboards.
  3. Subjectivity-Exploitability TradeoffVoting-based DAOs, lacking an equivalent of shareholder regulation, are vulnerable to attacks where 51% of participants collude to take all of the DAO’s assets for themselves […] The example supplied here will define a new, third, hypothetical form of blockchain or DAO governance. Every day we’re closer to Stross’s Accelerando.
  4. Sahale — open source cascading workflow visualizer to help you make sense of tasks decomposed into Hadoop jobs. (via Code as Craft)

A tale of two clusters: Mesos and YARN

With Myriad, analytics can be performed on the same hardware that runs your production services.

This is a tale of two siloed clusters. The first cluster is an Apache Hadoop cluster. This is an island whose resources are completely isolated to Hadoop and its processes. The second cluster is the description I give to all resources that are not a part of the Hadoop cluster. I break them up this way because Hadoop manages its own resources with Apache YARN (Yet Another Resource Negotiator). Which is nice for Hadoop, but all too often those resources are underutilized when there are no big data workloads in the queue. And then when a big data job comes in, those resources are stretched to the limit, and they are likely in need of more resources. That can be tough when you are on an island.


Isolated clusters. Source: Mesosphere and MapR, used with permission.

Hadoop was meant to tear down walls — albeit, data silo walls — but walls, nonetheless. What has happened is that while tearing some walls down, other types of walls have gone up in their place.

Another technology, Apache Mesos, is also meant to tear down walls — but Mesos has often been positioned to manage the “second cluster,” which are all of those other, non-Hadoop workloads.

This is where the story really starts, with these two silos of Mesos and YARN. They are often pitted against each other, as if they were incompatible. It turns out they work together, and therein lies my tale. Read more…


A million rows isn’t cool. You know what’s cool? A billion rows.

Changing your frame of reference when starting with SQL on Hadoop.

Editor’s note: John Russell will be one of the teachers of the tutorial Getting Started with Interactive SQL-On-Hadoop at Strata + Hadoop World in San Jose. Visit the Strata + Hadoop World website for more information on the program.

If you’re just getting started doing analytic work with SQL on Hadoop, a table with a million rows might seem like a good starting point for experimentation. Isn’t that a lot of data? While you can exercise the features of a traditional database with a million rows, for Hadoop it’s not nearly enough. Think billions of rows instead.

Let’s look at the ways a million-row table falls short. Understanding the data volumes involved with big data can help you avoid going down unproductive pathways based on misleading assumptions.

With a million-row table, every byte in each row represents a megabyte of total data volume. Let’s say your table represents people and has fields for name, address, occupation, salary, height, weight, number of children, and favorite food. Here’s what a sample field might look like, with a scale underneath to illustrate length:


This particular record takes up 78 characters, including the comma separators. A back-of-the-envelope calculation suggests that, if this is an average row, we’ll end up with about 78 megabytes of data in the table. (And don’t recycle that envelope just yet — doing analytics with Hadoop, you’ll do a lot of rough estimates like this to sanity-check your expectations about performance and scalability.) Read more…

Four short links: 15 January 2015

Four short links: 15 January 2015

Secure Docker Deployment, Devops Identity, Graph Processing, and Hadoop Alternative

  1. Docker Secure Deployment Guidelinesdeployment checklist for securely deploying Docker.
  2. The Devops Identity Crisis (Baron Schwartz) — I saw one framework-retailing bozo saying that devops was the art of ensuring there were no flaws in software. I didn’t know whether to cry or keep firing until the gun clicked.
  3. Apache Giraphan iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections.
  4. Apache Flinka data processing system and an alternative to Hadoop’s MapReduce component. It comes with its own runtime, rather than building on top of MapReduce. As such, it can work completely independently of the Hadoop ecosystem. However, Flink can also access Hadoop’s distributed file system (HDFS) to read and write data, and Hadoop’s next-generation resource manager (YARN) to provision cluster resources. Since most Flink users are using Hadoop HDFS to store their data, we ship already the required libraries to access HDFS.

The data lake model is a powerhouse for invention

In this O'Reilly Radar Podcast: Edd Dumbill on the data lake, and Rajiv Maheswaran on the science of moving dots.

In a recent blog post, Edd Dumbill, VP of strategy at Silicon Valley Data Science, wrote about the phrase “data lake.” Likening it to a dream, he described a data lake as “a place with data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment…Data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise.” He explained that he called it a “dream” because “we’ve a way to go to make the vision come true” — but noted he’s optimistic the dream can be realized.

Subscribe to the O’Reilly Radar Podcast

iTunes, SoundCloud, RSS

In this Radar Podcast epidsode, O’Reilly’s Mac Slocum sits down with Dumbill to talk about the data lake, the opportunities the model presents, and the driving forces behind the concept. Read more…


Signals from Strata + Hadoop World New York 2014

From unique data applications to factories of the future, here are key insights from Strata + Hadoop World New York 2014.

Experts from across the data world came together in New York City for Strata + Hadoop World New York 2014. Below we’ve assembled notable keynotes, interviews, and insights from the event.

Unusual data applications and the correct way to say “Hadoop”

Hadoop creator and Cloudera chief architect Doug Cutting discusses surprising data applications — from dating sites to premature babies — and he reveals the proper (but in no way required) pronunciation of “Hadoop.”

Read more…


The human side of Hadoop

Doug Cutting on applications of Hadoop, where "Hadoop" comes from, and the new partnership between Cloudera and O'Reilly.

Roger Magoulas, director of market research at O’Reilly and Strata co-chair, recently sat down with Doug Cutting, chief architect at Cloudera, to talk about the new partnership between Cloudera and O’Reilly, and the state of the Hadoop landscape.

Cutting shares interesting applications of Hadoop, several of which had touching human elements. For instance, he tells a story about visiting Children’s Healthcare of Atlanta and discovering the staff using Hadoop to reduce stress in babies. Read more…

Four short links: 5 August 2014

Four short links: 5 August 2014

Discussion Graph Tool, Superlinear Productivity, Go Concurrency, and R Map/Reduce Tools

  1. Discussion Graph Tool (Microsoft Research) — simplifies social media analysis by making it easy to extract high-level features and co-occurrence relationships from raw data.
  2. Superlinear Productivity in Collective Group Actions (PLoS ONE) — study of open source projects shows small groups exhibit non-linear productivity increases by size, which drop off at larger sizes. we document a size effect in the strength and variability of the superlinear effect, with smaller groups exhibiting widely distributed superlinear exponents, some of them characterizing highly productive teams. In contrast, large groups tend to have a smaller superlinearity and less variability.
  3. coop — cheat sheet of the most common concurrency program flows in Go.
  4. Tessera — set of open source tools around Hadoop, R, and visualization.
Four short links: 23 May 2014

Four short links: 23 May 2014

Educate Users, Hardware by the Numbers, Humans Beating Computers, Hadoop's Uncomfortable Fit

  1. How to Educate Users (Luke Wroblewski) — help new users in your app, not in a video.
  2. Hardware By The Numbers (Renee DiResta) — slides from her keynote at the Solid conference. The mean success rate across all sectors is 19.8%. On average, only 10% of hardware startups raise a second round.
  3. Humans Beating Computers (Wired) — Newman assembled a small team that became known as the “Air Divers”–the people who would dive deep into the individual complaints and surface with answers. Each was given a couple hundred support tickets connected to a specific issue that the data had identified as a hot-button topic. They would go off and read through each one, then come back and propose a fix. And in the end, this is what turned the situation around. Sometimes it’s easier to put people on the job than try to code the data analysis.
  4. Hadoop’s Uncomfortable Fit in HPCHadoop is being taken seriously only at a subset of supercomputing facilities in the US, and at a finer granularity, only by a subset of professionals within the HPC community.