O'Reilly Strata

An Introduction to Hadoop 2.0: Understanding the New Data Operating System

Sneak peek at an upcoming tutorial at Strata Santa Clara 2014

By Rich Raposa

Apache Hadoop 2.0 represents a generational shift in the architecture of Apache Hadoop. With YARN, Apache Hadoop is recast as a significantly more powerful platform – one that takes Hadoop beyond merely batch applications to taking its position as a ‘data operating system’ where HDFS is the file system and YARN is the operating system.

YARN is a re-architecture of Hadoop that allows multiple applications to run on the same platform. With YARN, applications run “in” Hadoop, instead of “on” Hadoop:


Read more…

Data Transformation

Skills of the Agile Data Wrangler

By Joe Hellerstein and Jeff Heer

As data processing has become more sophisticated, there has been little progress on improving the most time-consuming and tedious parts of the pipeline: Data Transformation tasks including discovery, structuring, and content cleaning . In standard practice, this kind of “data wrangling” requires writing idiosyncratic scripts in programming languages such as Python or R, or extensive manual editing using interactive tools such as Microsoft Excel. The result has two significantly negative outcomes. First, people with highly specialized skills (e.g., statistics, molecular biology, micro-economics) spend far more time in tedious data wrangling tasks than they do in exercising their specialty. Second, less technical users are often unable to wrangle their own data. The result in both cases is that significant data is often left unused due to the hurdle of transforming it into shape. Sadly, when it comes to standard practice in modern data analysis, “the tedium is the message.” In our upcoming tutorial at Strata, we will survey both sources and solutions to the problems of Data Transformation.

Analysts must regularly transform data to make it palatable to databases, statistics packages, and visualization tools. Data sets also regularly contain missing, extreme, duplicate or erroneous values that can undermine the results of analysis. These anomalies come from various sources, including human data entry error, inconsistencies between integrated data sets, and sensor interference. Our own interviews with data analysts have found that these types of transforms constitute the most tedious component of their analytic process. Flawed analyses due to dirty data are estimated to cost billions of dollars each year. Discovering and correcting data quality issues can also be costly: some estimate cleaning dirty data to account for 80 percent of the cost of data warehousing projects.

Read more…

Interactive Visualization of Big Data

By Jeffrey Heer

Human judgment is at the center of successful data analysis. This statement might initially seem at odds with the current Big Data frenzy and its focus on data management and machine learning methods. But while these tools provide immense value, it is important to remember that they are just that: tools. A hammer does not a carpenter make — though it certainly helps.

Consider the words of John Tukey 1, possibly the greatest statistician of the last half-century: “Nothing — not the careful logic of mathematics, not statistical models and theories, not the awesome arithmetic power of modern computers — nothing can substitute here for the flexibility of the informed human mind. Accordingly, both approaches and techniques need to be structured so as to facilitate human involvement and intervention.” Tukey goes on to write: “Some implications for effective data analysis are: (1) that it is essential to have convenience of interaction of people and intermediate results and (2) that at all stages of data analysis the nature and detail of output need to be matched to the capabilities of the people who use it and want it.” Though Tukey and colleagues voiced these sentiments nearly 50 years ago, they ring even more true today. The interested analyst is at the heart of the Big Data question: how well do our tools help users ask better questions, formulate hypotheses, spot anomalies, correct errors and create improved models and visualizations? To “facilitate human involvement” across “all stages of data analysis” is a grand challenge for our age.

Read more…

Design, Math, and Data

Lessons from the design community for developing data-driven applications

By Dean Malmgren

When you hear someone say, “that is a nice infographic” or “check out this sweet dashboard,” many people infer that they are “well-designed.” Creating accessible (or for the cynical, “pretty”) content is only part of what makes good design powerful. The design process is geared toward solving specific problems. This process has been formalized in many ways (e.g., IDEO’s Human Centered Design, Marc Hassenzahl’s User Experience Design, or Braden Kowitz’s Story-Centered Design), but the basic idea is that you have to explore the breadth of the possible before you can isolate truly innovative ideas. We, at Datascope Analytics, argue that the same is true of designing effective data science tools, dashboards, engines, etc — in order to design effective dashboards, you must know what is possible.

Read more…

Cloudera Impala: Bringing the SQL and Hadoop Worlds Together

By John Russell


When I came to work on the Cloudera Impala project, I found many things that were familiar from my previous experience with relational databases, UNIX systems, and the open source world. Yet other aspects were all new to me. I know from documenting both enterprise software and open source projects that it’s a special challenge when those two aspects converge. A lot of new users come in with 95% of the information they need, but they don’t know where the missing or outdated 5% is. One mistaken assumption or unfamiliar buzzword can make someone feel like a complete beginner. That’s why I was happy to have the opportunity to write this overview article, with room to explore how users from all kinds of backgrounds can understand and start using the Cloudera Impala product.

For database users, the Apache Hadoop ecosystem can feel like a new world:

  • Sysadmins don’t bat an eye when you say you want to work on terabytes or petabytes of data.
  • A networked cluster of machines isn’t a complicated or scary proposition. Instead, it’s the standard environment you ask an intern to set up on their first day as a training exercise.
  • All the related open source projects aren’t an either-or proposition. You work with a dozen components that all interoperate, stringing them together like a UNIX toolchain.
  • Read more…

A Patient a Day Keeps the Doctor in Play

By Julie Yoo, Chief Product Officer at Kyruus

Once upon a time, a world-renowned surgeon, Dr. Michael DeBakey, was summoned by the President when the Shah of Iran, a figure of political and strategic importance, fell ill with an enlarged spleen due to cancer. Dr. DeBakey was whisked away to Egypt to meet the Shah, made a swift diagnosis, and recommended an immediate operation to remove the spleen. The surgery lasted 80 minutes; the spleen, which had grown to 10 times its normal size, was removed, and the Shah made a positive recovery in the days following the surgery – that is, until he took a turn for the worse, and ultimately died from surgical complications a few weeks later. [1]

Sounds like a routine surgery gone awry, yes? But consider this: Dr. DeBakey was a cardiovascular surgeon – in other words, a surgeon whose area of specialization was in the operation of the heart and blood vessels, not the spleen. He was most well-known for his open heart bypass surgery techniques, and the vast majority of his peer-reviewed articles relate to cardiology-related operating techniques. High profile or not, why was a cardiovascular surgeon selected to perform an abdominal surgery?

Read more…

Machine Learning for Human Rights

Data Science for Social Good fellows partner with Ushahidi

By Rob Mitchum

Ushahidi opener

“2-car acc @ State & Lake, both drivers injred”

That short, hastily typed text message or tweet contains a lot of information that police, emergency responders, news organizations and drivers could use. A human observer could quickly identify that it refers to an auto accident, a medical emergency, and a street intersection in Chicago. But without prior experiences and lots of human input, a computer would likely have a hard time recognizing that State and Lake are streets in Chicago, that “acc” is short for accident, or that “injred” is a typo for “injured.”

dssg_logoComputer science offers machine learning and natural language processing techniques that can make sense of messy and disorganized text. Those techniques are at the heart of one of the summer projects of the Data Science for Social Good fellowship. (A University of Chicago program funded by Google’s Eric Schmidt and run by former Obama campaign chief data scientist Rayid Ghani, now at the Computation Institute. To learn more about the fellowship check out the website or read this previous post in the series). Working with the non-profit organization Ushahidi, a team of three fellows hopes to accelerate the processing of incoming messages during disasters, contested elections and other crises to quickly spread information and mobilize responses.

Read more…

The promise of Big Data to the CMO

A game changer for a marketer to pinpoint what a customer wants, when they want it, and how they want to hear about it

By Michael Gold, Farsite

Michael Gold

Michael Gold

My 2 and a half year old daughter loves the Mickey Mouse Clubhouse. She watches episodes on TV and our iPad. She wears Minnie Mouse flip flops and giggles just about every time she sees anything with Mickey, Daisy, Goofy…you get the idea. And when she’s old enough to go to Disney World, Minnie might walk right up to her and say “Hi Jemma!” and give her a big hug.

Creating a personal interaction between a child and a beloved Disney character exemplifies the company’s recent initiative to deliver a personalized, hassle-free experience at their theme parks. 1 With the wireless tracking wristband ‘MagicBand,’ families are able to reserve spots in lines for popular attractions, purchase items at the parks, and unlock their hotel rooms. The MagicBand is part of the MyMagic+ system, which enables Disney to collect data on visitors’ purchasing habits and real-time location, among other things. Disney will use this vast trove of information to deliver a personalized experience at the parks and tailor marketing messages and promotions.

Read more…

Training Aspiring Data Scientists in Problem Solving

Chicago-Based Data Science for Social Good Fellows Focus on Problem Solving


By Juan-Pablo Velez

Data science isn’t just about creating algorithms, writing code, or visualizing data. The first step is finding the right problem to solve.

dssg_logoMany of the governments and nonprofit organizations we’ve talked to while developing the Data Science for Social Good fellowship at the University of Chicago are excited about using data to make better decisions. (The fellowship is funded by Google’s Eric Schmidt and run by former Obama campaign chief data scientist Rayid Ghani, now at the University of Chicago’s Computation Institute. To learn more about the fellowship check out the website or read this previous post in the series.) But most aren’t quite sure where to start, while others pitch lots of problems that are initially too vague to solve with data. To help these organizations grow their impact, data scientists must be hands on. They need to quickly learn the ins-and-outs of unfamiliar fields, from health care to energy to municipal government. They need to understand what data is available both inside and outside an organization, and a knack for distilling ill-defined problems into clear and tractable ones.

Read more…

Data Science for Social Good: A Fellowship

Training Aspiring Data Scientists in Chicago


By Juan-Pablo Velez

The Fellowship

As technology penetrates further into everyday life, we’re creating lots of data. Businesses are scrambling to find data scientists to make sense of all this data and turn it into better decisions.

Businesses aren’t alone. Data science could transform how governments and nonprofits tackle society’s problems. The problem is, most governments and nonprofits simply don’t know what’s possible yet. There are too few data scientists out there and too many spending their days optimizing ads instead of bettering lives. To make real impact with data, we need to work on high-impact projects that show these organizations the power of analytics. And we need to expose data scientists to the problems that really matter.

DSSG_BW_Cropped2That’s exactly why we’re doing the Eric and Wendy Schmidt Data Science for Social Good summer fellowship at the University of Chicago. The program is led by Rayid Ghani, former chief data scientist for the 2012 Obama campaign, and is funded by Google Chairman Eric Schmidt.

We’ve brought three dozen aspiring data scientists from all over the world to Chicago to spend a summer working on data science projects with social impact. The fellows are working closely with governments and nonprofits (including the City of Chicago, the Chicago Transit Authority, and the Nurse-Family Partnership) to take on real-world problems in education, health, energy, transportation, and more. (To read up on our project, check out dssg.io/projects or to get involved, go to github.com/dssg.)

DSSG_BW_Cropped1bLots of folks have been asking about how we’re training data scientists.

Data scientists are a hybrid group with computer science, statistics, machine learning, data mining, and database skills. These skills take years to learn and there’s no way to teach all of them during a few weeks. Instead of starting from scratch, we decided to start with students in computational and quantitative fields – folks that already have some of these skills and use them daily in an academic setting. And we gave them the opportunity to apply their abilities to solve real-world problems and to pick up the skills they’re missing.

Read more…