"Hadoop" entries

Signals from Strata + Hadoop World New York 2014

From unique data applications to factories of the future, here are key insights from Strata + Hadoop World New York 2014.

Experts from across the data world came together in New York City for Strata + Hadoop World New York 2014. Below we’ve assembled notable keynotes, interviews, and insights from the event.

Unusual data applications and the correct way to say “Hadoop”

Hadoop creator and Cloudera chief architect Doug Cutting discusses surprising data applications — from dating sites to premature babies — and he reveals the proper (but in no way required) pronunciation of “Hadoop.”

Read more…

Comment

The human side of Hadoop

Doug Cutting on applications of Hadoop, where "Hadoop" comes from, and the new partnership between Cloudera and O'Reilly.

Roger Magoulas, director of market research at O’Reilly and Strata co-chair, recently sat down with Doug Cutting, chief architect at Cloudera, to talk about the new partnership between Cloudera and O’Reilly, and the state of the Hadoop landscape.

Cutting shares interesting applications of Hadoop, several of which had touching human elements. For instance, he tells a story about visiting Children’s Healthcare of Atlanta and discovering the staff using Hadoop to reduce stress in babies. Read more…

Comment
Four short links: 5 August 2014

Four short links: 5 August 2014

Discussion Graph Tool, Superlinear Productivity, Go Concurrency, and R Map/Reduce Tools

  1. Discussion Graph Tool (Microsoft Research) — simplifies social media analysis by making it easy to extract high-level features and co-occurrence relationships from raw data.
  2. Superlinear Productivity in Collective Group Actions (PLoS ONE) — study of open source projects shows small groups exhibit non-linear productivity increases by size, which drop off at larger sizes. we document a size effect in the strength and variability of the superlinear effect, with smaller groups exhibiting widely distributed superlinear exponents, some of them characterizing highly productive teams. In contrast, large groups tend to have a smaller superlinearity and less variability.
  3. coop — cheat sheet of the most common concurrency program flows in Go.
  4. Tessera — set of open source tools around Hadoop, R, and visualization.
Comment
Four short links: 23 May 2014

Four short links: 23 May 2014

Educate Users, Hardware by the Numbers, Humans Beating Computers, Hadoop's Uncomfortable Fit

  1. How to Educate Users (Luke Wroblewski) — help new users in your app, not in a video.
  2. Hardware By The Numbers (Renee DiResta) — slides from her keynote at the Solid conference. The mean success rate across all sectors is 19.8%. On average, only 10% of hardware startups raise a second round.
  3. Humans Beating Computers (Wired) — Newman assembled a small team that became known as the “Air Divers”–the people who would dive deep into the individual complaints and surface with answers. Each was given a couple hundred support tickets connected to a specific issue that the data had identified as a hot-button topic. They would go off and read through each one, then come back and propose a fix. And in the end, this is what turned the situation around. Sometimes it’s easier to put people on the job than try to code the data analysis.
  4. Hadoop’s Uncomfortable Fit in HPCHadoop is being taken seriously only at a subset of supercomputing facilities in the US, and at a finer granularity, only by a subset of professionals within the HPC community.
Comment

Big Data systems are making a difference in the fight against cancer

Open source, distributed computing tools speedup an important processing pipeline for genomics data

As open source, big data tools enter the early stages of maturation, data engineers and data scientists will have many opportunities to use them to “work on stuff that matters”. Along those lines, computational biology and medicine are areas where skilled data professionals are already beginning to make an impact. I recently came across a compelling open source project from UC Berkeley’s AMPLab: ADAM is a processing engine and set of formats for genomics data.

Second-generation sequencing machines produce more detailed and thus much larger files for analysis (250+ GB file for each person). Existing data formats and tools are optimized for single-server processing and do not easily scale out. ADAM uses distributed computing tools and techniques to speedup key stages of the variant processing pipeline (including sorting and deduping):

Variant Calling Pipeline

Very early on the designers of ADAM realized that a well-designed data schema (that specifies the representation of data when it is accessed) was key to having a system that could leverage existing big data tools. The ADAM format uses the Apache Avro data serialization system and comes with a human-readable schema that can be accessed using many programming languages (including C/C++/C#, Java/Scala, php, Python, Ruby). ADAM also includes a data format/access API implemented on top of Apache Avro and Parquet, and a data transformation API implemented on top of Apache Spark. Because it’s built with widely adopted tools, ADAM users can leverage components of the Hadoop (Impala, Hive, MapReduce) and BDAS (Shark, Spark, GraphX, MLbase) stacks for interactive and advanced analytics.

Read more…

Comment

An Introduction to Hadoop 2.0: Understanding the New Data Operating System

Sneak peek at an upcoming tutorial at Strata Santa Clara 2014

By Rich Raposa

Apache Hadoop 2.0 represents a generational shift in the architecture of Apache Hadoop. With YARN, Apache Hadoop is recast as a significantly more powerful platform – one that takes Hadoop beyond merely batch applications to taking its position as a ‘data operating system’ where HDFS is the file system and YARN is the operating system.

YARN is a re-architecture of Hadoop that allows multiple applications to run on the same platform. With YARN, applications run “in” Hadoop, instead of “on” Hadoop:

R1

Read more…

Comments: 2

How to analyze 100 million images for $624

There's a lot of new ground to be explored in large-scale image processing.

Jetpac is building a modern version of Yelp, using big data rather than user reviews. People are taking more than a billion photos every single day, and many of these are shared publicly on social networks. We analyze these pictures to discover what they can tell us about bars, restaurants, hotels, and other venues around the world — spotting hipster favorites by the number of mustaches, for example.

Treating large numbers of photos as data, rather than just content to display to the user, is a pretty new idea. Traditionally it’s been prohibitively expensive to store and process image data, and not many developers are familiar with both modern big data techniques and computer vision. That meant we had to cut a path through some thick underbrush to get a system working, but the good news is that the free-falling price of commodity servers makes running it incredibly cheap. Read more…

Comments: 3

Dealing with Data in the Hadoop Ecosystem

Hadoop, Sqoop, and ZooKeeper

Kathleen Ting (@kate_ting), Technical Account Manager at Cloudera, and our own Andy Oram (@praxagora) sat down to discuss how to work with structured and unstructured data as well as how to keep a system up and running that is crunching that data.

Key highlights include:

  • Misconfigurations consist of almost half of the support issues that the team at Cloudera is seeing [Discussed at 0:22]
  • ZooKeeper, the canary in the Hadoop coal mine [Discussed at 1:10]
  • Leaky clients are often a problem ZooKeeper detects [Discussed at 2:10]
  • Sqoop is a bulk data transfer tool [Discussed at 2:47]
  • Sqoop helps to bring together structured and unstructured data [Discussed at 3:50]
  • ZooKeep is not for storage, but coordination, reliability, availability [Discussed at 4:44]

You can view the full interview here:

Read more…

Comment

Databricks aims to build next-generation analytic tools for Big Data

A new startup will accelerate the maturation of the Berkeley Data Analytics Stack

Key technologists behind the Berkeley Data Analytics Stack (BDAS) have launched a company that will build software – centered around Apache Spark and Shark – for analyzing big data. Details of their product and strategy are sparse, as the company is operating in stealth mode. But through conversations with the founders of Databricks, I’ve learned that they’ll be building general purpose analytic tools that can leverage HDFS, YARN, as well as other components of BDAS.

It will be interesting to see how the team transitions to the corporate world. Their Series A funding round of $14M is being led by Andreessen Horowitz. The board will be composed of Ben Horowitz, Scott Shenker, Matei Zaharia, and Ion Stoica.

Read more…

Comment

Stream Processing and Mining just got more interesting

A general purpose stream processing framework from the team behind Kafka and new techniques for computing approximate quantiles

Largely unknown outside data engineering circles, Apache Kafka is one of the more popular open source, distributed computing projects. Many data engineers I speak with either already use it or are planning to do so. It is a distributed message broker used to store1 and send data streams. Kafka was developed by Linkedin were it remains a vital component of their Big Data ecosystem: many critical online and offline data flows rely on feeds supplied by Kafka servers.

Apache Samza: a distributed stream processing framework
Behind Kafka’s success as an open source project is a team of savvy engineers who have spent2 the last three years making it a rock solid system. The developers behind Kafka realized early on that it was best to place the bulk of data processing (i.e., stream processing) in another system. Armed with specific use cases, work on Samza proceeded in earnest about a year ago. So while they examined existing streaming frameworks (such as Storm, S4, Spark Streaming), Linkedin engineers wanted a system that better fit their needs3 and requirements:

Linkedin Samza

Read more…

Comments: 2