ENTRIES TAGGED "data scientists"

Data Science for Business

What business leaders need to know about data and data analysis to drive their businesses forward.

A couple of years ago, Claudia Perlich introduced me to Foster Provost, her PhD adviser. Foster showed me the book he was writing with Tom Fawcett, and using in his teaching at NYU.

DataScienceForBusinessCoverFoster and Tom have a long history of applying data to practical business problems. Their book, which evolved into Data Science for Business, was different from all the other data science books I’ve seen. It wasn’t about tools: Hadoop and R are scarcely mentioned, if at all. It wasn’t about coding: business students don’t need to learn how to implement machine learning algorithms in Python. It is about business: specifically, it’s about the data analytic thinking that business people need to work with data effectively.

Data analytic thinking means knowing what questions to ask, how to ask those questions, and whether the answers you get make sense. Business leaders don’t (and shouldn’t) do the data analysis themselves. But in this data-driven age, it’s critically important for business leaders to understand how to work with the data scientists on their teams. In today’s business world, it’s essential to understand which algorithms are used for different applications, how statistics are used to create models of human and economic behavior, overfitting and its symptoms, and much more. You might not need to know how to implement a machine learning algorithm, but you do need to understand the ideas the data scientists on your team are using.

The goal of data science is putting data to work. That’s what Data Science for Business is all about, and the reason I’m excited to see us publishing it. There are many books about data science, and an increasing number of undergraduate and graduate programs in data science. But I haven’t seen anything that teaches data science for the leaders who will be using data to drive their businesses forward.

Comment

Data Science for Business

What business leaders need to know about data and data analysis to drive their businesses forward.

DataScienceForBusinessCoverA couple of years ago, Claudia Perlich introduced me to Foster Provost, her PhD adviser. Foster showed me the book he was writing with Tom Fawcett, and using in his teaching at NYU.

Foster and Tom have a long history of applying data to practical business problems. Their book, which evolved into Data Science for Business, was different from all the other data science books I’ve seen. It wasn’t about tools: Hadoop and R are scarcely mentioned, if at all. It wasn’t about coding: business students don’t need to learn how to implement machine learning algorithms in Python. It is about business: specifically, it’s about the data analytic thinking that business people need to work with data effectively.

Data analytic thinking means knowing what questions to ask, how to ask those questions, and whether the answers you get make sense. Business leaders don’t (and shouldn’t) do the data analysis themselves. But in this data-driven age, it’s critically important for business leaders to understand how to work with the data scientists on their teams. Read more…

Comment

Data Science for Social Good: A Fellowship

Training Aspiring Data Scientists in Chicago

DSSG_BW_Cropped4_larger

By Juan-Pablo Velez

The Fellowship

As technology penetrates further into everyday life, we’re creating lots of data. Businesses are scrambling to find data scientists to make sense of all this data and turn it into better decisions.

Businesses aren’t alone. Data science could transform how governments and nonprofits tackle society’s problems. The problem is, most governments and nonprofits simply don’t know what’s possible yet. There are too few data scientists out there and too many spending their days optimizing ads instead of bettering lives. To make real impact with data, we need to work on high-impact projects that show these organizations the power of analytics. And we need to expose data scientists to the problems that really matter.

DSSG_BW_Cropped2That’s exactly why we’re doing the Eric and Wendy Schmidt Data Science for Social Good summer fellowship at the University of Chicago. The program is led by Rayid Ghani, former chief data scientist for the 2012 Obama campaign, and is funded by Google Chairman Eric Schmidt.

We’ve brought three dozen aspiring data scientists from all over the world to Chicago to spend a summer working on data science projects with social impact. The fellows are working closely with governments and nonprofits (including the City of Chicago, the Chicago Transit Authority, and the Nurse-Family Partnership) to take on real-world problems in education, health, energy, transportation, and more. (To read up on our project, check out dssg.io/projects or to get involved, go to github.com/dssg.)

DSSG_BW_Cropped1bLots of folks have been asking about how we’re training data scientists.

Data scientists are a hybrid group with computer science, statistics, machine learning, data mining, and database skills. These skills take years to learn and there’s no way to teach all of them during a few weeks. Instead of starting from scratch, we decided to start with students in computational and quantitative fields – folks that already have some of these skills and use them daily in an academic setting. And we gave them the opportunity to apply their abilities to solve real-world problems and to pick up the skills they’re missing.

Read more…

Comment

Leading Indicators

In a conversation with Q Ethan McCallum (who should be credited as co-author), we wondered how to evaluate data science groups. If you’re looking at an organization’s data science group from the outside, possibly as a potential employee, what can you use to evaluate it? It’s not a simple problem under the best of conditions: you’re not an insider, so you don’t know the full story of how many projects it has tried, whether they have succeeded or failed, relations between the data group, management, and other departments, and all the other stuff you’d like to know but will never be told.

Our starting point was remote: Q told me about Tyler Brulé’s travel writing for Financial Times (behind a paywall, unfortunately), in which he says that a club sandwich is a good proxy for hotel quality: you go into the restaurant and order a club sandwich. A club sandwich isn’t hard to make: there’s no secret recipe or technique that’s going to make Hotel A’s sandwich significantly better than B’s. But it’s easy to cut corners on ingredients and preparation. And if a hotel is cutting corners on their club sandwiches, they’re probably cutting corners in other places.

Read more…

Comment

A different take on data skepticism

Our tools should make common cases easy and safe, but that's not the reality today.

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

Well, arguing against greater understanding of the methods we apply is like arguing against motherhood and apple pie, and Cathy and Mike are spot on in their diagnoses of the current situation. And yet …

There is so much value to be gained if we can put the power of learning, inference, and prediction methods into the hands of more developers and domain experts. But how can we avoid the pitfalls that Cathy and Mike are rightly concerned about? If a seemingly simple method like k-nearest neighbors classification is dangerous in unskilled hands (and it certainly is), then what hope is there? Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well? Read more…

Comment: 1

Data’s missing ingredient? Rhetoric.

Arguments are the glue that connects data to decisions

Data is key to decision making. Yet we are rarely faced with a situation where things can be put in to such a clear logical form that we have no choice but to accept the force of evidence before us. In practice, we should always be weighing alternatives, looking for missed possibilities, and considering what else we need to figure out before we can proceed.

Arguments are the glue that connects data to decisions. And if we want good decisions to prevail, both as decision makers and as data scientists, we need to better understand how arguments function. We need to understand the best ways that arguments and data interact. The statistical tools we learn in classrooms are not sufficient alone to deal with the messiness of practical decision-making.

Examples of this fill the headlines. You can see evidence of rigid decision-making in how the American medical establishment decides what constitutes a valid study result. By custom and regulation, there is an official statistical breaking point for all studies. Below this point, a result will be acted upon. Above, it won’t be. Cut and dry, but dangerously brittle.

Read more…

Comment

How do you become a data scientist? Well, it depends

My obsession with data and user needs is now focused on the many paths toward data science.

Thanksgiving 2012

Over Thanksgiving, Richie and Violet asked me if I preferred the iPhone or the Galaxy SIII. I have both. It is a long story. My response was, “It depends.” Richie, who would probably bleed Apple if you cut him, was very unsatisfied with my answer. Violet was more diplomatic. Yet, it does depend. It depends on what the user wants to use the device for.

I say, “It depends” a lot in my life.

Both in the personal life and the work life … well, because it really is all one life isn’t it?  With my work over the past decade or so, I have been obsessive about being user-focused. I spend a lot of time thinking about whom a product, feature, or service is for and how they will use it. Not how I want them to use it — how they want to use it and what problem they are trying to solve with it.

Before I joined O’Reilly, I was obsessively focused on the audience for my data analysis. “C” level execs look for different kinds of insights than a director of engineering. A field sales rep looks for different insights than a software developer. Understanding more about who the user or audience was for a data project enabled me to map the insights to the user’s role, their priorities, and how they wanted to use the data. Because, you know what isn’t too great? When you spend a significant amount of time working on something that does not get used or is not what someone needed to help them in their job.
Read more…

Comment: 1

Data science in the natural sciences

Big data is shaping diverse fields, showing that past predictions from data-driven natural sciences are now coming to pass.

I find myself having conversations recently with people from increasingly diverse fields, both at Columbia and in local startups, about how their work is becoming “data-informed” or “data-driven,” and about the challenges posed by applied computational statistics or big data.

A view from health and biology in the 1990s

In discussions with, as examples, New York City journalists, physicists, or even former students now working in advertising or social media analytics, I’ve been struck by how many of the technical challenges and lessons learned are reminiscent of those faced in the health and biology communities over the last 15 years, when these fields experienced their own data-driven revolutions and wrestled with many of the problems now faced by people in other fields of research or industry.

It was around then, as I was working on my PhD thesis, that sequencing technologies became sufficient to reveal the entire genomes of simple organisms and, not long thereafter, the first draft of the human genome. This advance in sequencing technologies made possible the “high throughput” quantification of, for example,

  • the dynamic activity of all the genes in an organism; or
  • the set of all protein-protein interactions in an organism; or even
  • statistical comparative genomics revealing how small differences in genotype correlate with disease or other phenotypes.

These advances required formation of multidisciplinary collaborations, multi-departmental initiatives, advances in technologies for dealing with massive datasets, and advances in statistical and mathematical methods for making sense of copious natural data. Read more…

Comments: 2
Data Jujitsu: The art of turning data into product

Data Jujitsu: The art of turning data into product

Smart data scientists can make big problems small.

Having worked in academia, government and industry, I’ve had a unique opportunity to build products in each sector. Much of this product development has been around building data products. Just as methods for general product development have steadily improved, so have the ideas for developing data products. Thanks to large investments in the general area of data science, many major innovations (e.g., Hadoop, Voldemort, Cassandra, HBase, Pig, Hive, etc.) have made data products easier to build. Nonetheless, data products are unique in that they are often extremely difficult, and seemingly intractable for small teams with limited funds. Yet, they get solved every day.

How? Are the people who solve them superhuman data scientists who can come up with better ideas in five minutes than most people can in a lifetime? Are they magicians of applied math who can cobble together millions of lines of code for high-performance machine learning in a few hours? No. Many of them are incredibly smart, but meeting big problems head-on usually isn’t the winning approach. There’s a method to solving data problems that avoids the big, heavyweight solution, and instead, concentrates building something quickly and iterating. Smart data scientists don’t just solve big, hard problems; they also have an instinct for making big problems small.

We call this Data Jujitsu: the art of using multiple data elements in clever ways to solve iterative problems that, when combined, solve a data problem that might otherwise be intractable. It’s related to Wikipedia’s definition of the ancient martial art of jujitsu: “the art or technique of manipulating the opponent’s force against himself rather than confronting it with one’s own force.”

How do we apply this idea to data? What is a data problem’s “weight,” and how do we use that weight against itself? These are the questions that we’ll work through in the subsequent sections.

Read more…

Comments: 3
Now available: "Planning for Big Data"

Now available: "Planning for Big Data"

A free handbook for anybody wanting to understand and use big data.

"Planning for Big Data" is a new book that helps you understand what big data is, why it matters, and where to get started.

Comments: 2