"data science" entries

A million rows isn’t cool. You know what’s cool? A billion rows.

Changing your frame of reference when starting with SQL on Hadoop.

Editor’s note: John Russell will be one of the teachers of the tutorial Getting Started with Interactive SQL-On-Hadoop at Strata + Hadoop World in San Jose. Visit the Strata + Hadoop World website for more information on the program.

If you’re just getting started doing analytic work with SQL on Hadoop, a table with a million rows might seem like a good starting point for experimentation. Isn’t that a lot of data? While you can exercise the features of a traditional database with a million rows, for Hadoop it’s not nearly enough. Think billions of rows instead.

Let’s look at the ways a million-row table falls short. Understanding the data volumes involved with big data can help you avoid going down unproductive pathways based on misleading assumptions.

With a million-row table, every byte in each row represents a megabyte of total data volume. Let’s say your table represents people and has fields for name, address, occupation, salary, height, weight, number of children, and favorite food. Here’s what a sample field might look like, with a scale underneath to illustrate length:

John_Russel_1_data_scale

This particular record takes up 78 characters, including the comma separators. A back-of-the-envelope calculation suggests that, if this is an average row, we’ll end up with about 78 megabytes of data in the table. (And don’t recycle that envelope just yet — doing analytics with Hadoop, you’ll do a lot of rough estimates like this to sanity-check your expectations about performance and scalability.) Read more…

Comment

Closing the gender gap in tech

Stories from women who are making a big impact on the field of big data.

The gender gap in tech is not news, but here’s what is: it’s shrinking. In O’Reilly’s latest report — Women in Data: Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education — female data practitioners discuss their work, their achievements, and the attitudes that have propelled them forward to career success.

Through a series of 15 interviews with women across the data field, we’ve uncovered stories we think you’ll find and both interesting and inspiring. The interviews explore:

  • Interviewees’ views about opportunities for women in the fields of science, technology, engineering, and math (STEM)
  • Benefits of the data field as a career choice for women
  • The changing attitudes of Millennials toward women working in data
  • Remedies for continuing to close the gender gap in tech

Our findings reveal an important consensus among the women we interviewed — the role of female mentors and role models working in STEM is extremely important for opening up the pathway for more women to enter these fields. In fact, the impact that mentors have had on our interviewees has inspired many of them to serve as mentors to other female colleagues, and younger generations of girls, today. Read more…

Comment

Human-in-the-loop machine learning

Practical machine-learning applications and strategies from experts in active learning.

What do you call a practice that most data scientists have heard of, few have tried, and even fewer know how to do well? It turns out, no one is quite certain what to call it. In our latest free report Real-World Active Learning: Applications and Strategies for Human-in-the-Loop Machine Learning, we examine the relatively new field of “active learning” — also referred to as “human computation,” “human-machine hybrid systems,” and “human-in-the-loop machine learning.” Whatever you call it, the field is exploding with practical applications that are proving the efficiency of combining human and machine intelligence.

Learn from the experts

Through in-depth interviews with experts in the field of active learning and crowdsource management, industry analyst Ted Cuzzillo reveals top tips and strategies for using short-term human intervention to actively improve machine models. As you’ll discover, the point at which a machine model fails is precisely where there’s an opportunity to insert — and benefit from — human judgment.

Find out:

  • When active learning works best
  • How to manage crowdsource contributors (including expert-level contributors)
  • Basic principles of labeling data
  • Best practice methods for assessing labels
  • When to skip the crowd and mine your own data

Explore real-world examples

This report gives you a behind-the-scenes look at how human-in-the-loop machine learning has helped improve the accuracy of Google Maps, match business listings at GoDaddy, rank top search results at Yahoo!, refer relevant job postings to people on LinkedIn, identify expert-level contributors using the Quizz recruitment method, and recommend women’s clothing based on customer and product data at Stitch Fix. Read more…

Comment: 1

Network structure and dynamics in online social systems

Understanding information cascades, viral content, and significant relationships.

weave_911_Joel_Ormsby_Flickr

I rarely work with social network data, but I’m familiar with the standard problems confronting data scientists who work in this area. These include questions pertaining to network structure, viral content, and the dynamics of information cascades.

At last year’s Strata + Hadoop World NYC, Cornell Professor and Nevanlinna Prize Winner Jon Kleinberg walked the audience through a series of examples from social network analysis, looking at the content of shared photos and text, as well as the structures of the networks. It was a truly memorable presentation from one of the foremost experts in network analysis. Each of the problems he discussed would be of interest to marketing professionals, and the analytic techniques he described were accessible to many data scientists. What struck me is that while these topics are easy to describe, framing the right question requires quite a bit of experience with the underlying data.

Predicting whether an information cascade will double in size

Can you predict if a piece of information (say a photo) will be shared only a few times or hundreds (if not thousands) of times? Large cascades are very rare, making the task of predicting eventual size difficult. You either default to a pathological answer (after all most pieces of information are shared only once), or you create a balanced data set (comprised of an equal number of small and large cascades) and end up solving an artificial task.

Thinking of a social network as an information transport layer, Kleinberg and his colleagues instead set out to track the evolution of cascades. In the process, they framed an interesting balanced algorithmic prediction problem: given a cascade of size k, predict whether it will reach size 2k (it turns out 2k is roughly the median size of a cascade conditional on whether it reaches size k). Read more…

Comment

Beyond AI: artificial compassion

If what we are trying to build is artificial minds, intelligence might be the smaller, easier part.

LIght_of_ideas_Saad-Faruque_FlickrWhen we talk about artificial intelligence, we often make an unexamined assumption: that intelligence, understood as rational thought, is the same thing as mind. We use metaphors like “the brain’s operating system” or “thinking machines,” without always noticing their implicit bias.

But if what we are trying to build is artificial minds, we need only look at a map of the brain to see that in the domain we’re tackling, intelligence might be the smaller, easier part.

Maybe that’s why we started with it.

After all, the rational part of our brain is a relatively recent add-on. Setting aside unconscious processes, most of our gray matter is devoted not to thinking, but to feeling.

There was a time when we deprecated this larger part of the mind, as something we should either ignore or, if it got unruly, control.

But now we understand that, as troublesome as they may sometimes be, emotions are essential to being fully conscious. For one thing, as neurologist Antonio Damasio has demonstrated, we need them in order to make decisions. A certain kind of brain damage leaves the intellect unharmed, but removes the emotions. People with this affliction tend to analyze options endlessly, never settling on a final choice. Read more…

Comment

Improving on the Lambda Architecture for streaming analysis

Using fast, scalable relational databases to build event-oriented applications.

Modern organizations have started pushing their big data initiatives beyond historical analysis. Fast data creates big data, and applications are being developed that capture value, specifically real-time analytics, the moment fast data arrives. The need for real-time analysis of streaming data for real-time analytics, alerting, customer engagement or other on-the-spot decision-making, is converging on a layered software setup called the Lambda Architecture.

The Lambda Architecture, a collection of both big and fast data software components, is a software paradigm designed to capture value, specifically analytics, from not only historical data, but also from data that is streaming into the system.

In this article, I’ll explain the challenges that this architecture currently presents and explore some of the weaknesses. I’ll also discuss an alternative architecture using an in-memory database that can simplify and extend the capabilities of Lambda. Read more…

Comments: 4

Getting started with data science in the cloud

Learn how to manipulate data, and construct and evaluate models in Azure ML, using a complete data science example.

Large-scale machine learning, or predictive analytics, is having a powerful impact across many industries. By using machine learning, companies, governments, and not-for-profits are replacing guesses and seat-of-the-pants estimates with valuable data-driven predictions.

Deriving value from machine learning, however, is often impeded by complex technology deployments and long model-development cycles. Fortunately, machine learning and data science are undergoing democratization. Workflow environments make tools for building and evaluating sophisticated machine learning models accessible to a wider range of users. Cloud-based environments provide secure ubiquitous access to data storage and powerful data science tools.

To get you started creating and evaluating your own machine learning models, O’Reilly has commissioned a new report: “Data Science in the Cloud, with Azure Machine Learning and R.” We use an in-depth data science example — predicting bicycle rental demand — to show you how to perform basic data science tasks, including data management, data transformation, machine learning, and model evaluation in the Microsoft Azure Machine Learning cloud environment. Using a free-tier Azure ML account, example R scripts, and the data provided, the report provides hands-on experience with this practical data science example. Read more…

Comments: 4

A human-centered approach to data-driven design

The O'Reilly Radar Podcast: Arianna McClain on humanizing data-driven design, and Dirk Knemeyer on design in emerging tech.

This week on the O’Reilly Radar Podcast, O’Reilly’s Roger Magoulas talks with Arianna McClain, a senior hybrid design researcher at IDEO, about storytelling through data; the interdependent nature of qualitative and quantitative data; and the human-centered, data-driven design approach at IDEO.

Subscribe to the O’Reilly Radar Podcast

iTunes, SoundCloud, RSS

In their interview, Magoulas noted that in our research at O’Reilly, we’ve been talking a lot about the importance of the social science design element in getting the most out of data. McClain emphasized the importance of storytelling through data at IDEO and described IDEO’s human-centered approach to data-driven design:

“IDEO really believes in staying and remaining human-centered throughout the data journey. Starting off with, how might we measure something, how might we measure a behavior. We don’t sit in a room and come up with an algorithm or come up with a question. We start by talking to people. … We’re trying to build measures and survey questions to understand at scale how people make decisions. … IDEO remains data-driven to how we analyze and synthesize our findings. When we’re given a large data set, we don’t analyze it and write a report and give it to people and say, ‘This is the direction we think you should go.’

“Instead, we look at segmentations in the data, and stories in the data, and how the data clusters. Then we go back, and we try to find people who are representative of that cluster or that segmentation. The segmentations, again, are not based on demographic variables. They are based on needs and insights that we heard in our qualitative research. … What we’ve recognized is that something that seems so clear in the analysis is often very nuanced, and it can inform our design.”

Read more…

Comment

The evolution of GraphLab

The O'Reilly Data Show Podcast: Carlos Guestrin on the early days of GraphLab and the evolution of GraphLab Create.

Editor’s note: Carlos Guestrin will be part of the team teaching Large-scale Machine Learning Day at Strata + Hadoop World in San Jose. Visit the Strata + Hadoop World website for more information on the program.

I only really started playing around with GraphLab when the companion project GraphChi came onto the scene. By then I’d heard from many avid users and admired how their user conference instantly became a popular San Francisco Bay Area data science event. For this podcast episode, I sat down with Carlos Guestrin, co-founder/CEO of Dato, a start-up launched by the creators of GraphLab. We talked about the early days of GraphLab, the evolution of GraphLab Create, and what’s he’s learned from starting a company.

MATLAB for graphs

Guestrin remains a professor of computer science at the University of Washington, and GraphLab originated when he was still a faculty member at Carnegie Mellon. GraphLab was built by avid MATLAB users who needed to do large scale graphical computations to demonstrate their research results. Guestrin shared some of the backstory:

“I was a professor at Carnegie Mellon for about eight years before I moved to Seattle. A couple of my students, Joey Gonzales and Yucheng Low were working on large scale distributed machine learning algorithms specially with things called graphical models. We tried to implement them to show off the theorems that we had proven. We tried to run those things on top of Hadoop and it was really slow. We ended up writing those algorithms on top of MPI which is a high performance computing library and it was just a pain. It took a long time and it was hard to reproduce the results and the impact it had on us is that writing papers became a pain. We wanted a system for my lab that allowed us to write more papers more quickly. That was the goal. In other words so they could implement this machine learning algorithms more easily, more quickly specifically on graph data which is what we focused on.”

Read more…

Comment

It’s not just about Hadoop core anymore

For maximum business value, big data applications have to involve multiple Hadoop ecosystem components.

Data is deluging today’s enterprise organizations from ever-expanding sources and in ever-expanding formats. To gain insight from this valuable resource, organizations have been adopting Apache Hadoop with increasing momentum. Now, the most successful players in big data enterprise are no longer only utilizing Hadoop “core” (i.e., batch processing with MapReduce), but are moving toward analyzing and solving real-world problems using the broader set of tools in an enterprise data hub (often interactively) — including components such as Impala, Apache Spark, Apache Kafka, and Search. With this new focus on workload diversity comes an increased demand for developers who are well-versed in using a variety of components across the Hadoop ecosystem.

Due to the size and variety of the data we’re dealing with today, a single use case or tool — no matter how robust — can camouflage the full, game-changing potential of Hadoop in the enterprise. Rather, developing end-to-end applications that incorporate multiple tools from the Hadoop ecosystem, not just the Hadoop core, is the first step toward activating the disparate use cases and analytic capabilities of which an enterprise data hub is capable. Whereas MapReduce code primarily leverages Java skills, developers who want to work on full-scale big data engineering projects need to be able to work with multiple tools, often simultaneously. An authentic big data applications developer can ingest and transform data using Kite SDK, write SQL queries with Impala and Hive, and create an application GUI with Hue. Read more…

Comment: 1