"data skepticism" entries

Understanding skepticism

Skepticism isn't a blanket rejection of data; it's central to understanding data.

I’d like to correct the impression, given by Derrick Harris on GigaOm, that I’m part of a backlash against “big data.”

I’m not skeptical about data or the power of data, but you don’t have to look very far or very hard to see data abused. The best people to be skeptical about the data, and to point out the abuse of data, are data scientists because they understand problems such as overfitting, bias, and much more.

Cathy O’Neil recently wrote about a Congressional hearing in which a teacher at a new data science program dodged some perceptive questions about whether he was teaching students to be skeptical about results, whether he was teaching students how to test whether their observations were real signals or just noise. Anyone who has worked with data knows that false correlations come cheaply, particularly when you’re working with a lot of data. But ducking that question is not the attitude we need.

Data is valuable. I see no end to the collection or analysis of data, nor should their be an and. But like any tool, we have to be careful about how we use it. Skepticism isn’t a blanket rejection of data; it’s central to understanding data. That’s precisely what makes “science” science.

And of all people, journalists should understand what skepticism means, even if they don’t have the technical tools to practice it.

A different take on data skepticism

Our tools should make common cases easy and safe, but that's not the reality today.

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

Well, arguing against greater understanding of the methods we apply is like arguing against motherhood and apple pie, and Cathy and Mike are spot on in their diagnoses of the current situation. And yet …

There is so much value to be gained if we can put the power of learning, inference, and prediction methods into the hands of more developers and domain experts. But how can we avoid the pitfalls that Cathy and Mike are rightly concerned about? If a seemingly simple method like k-nearest neighbors classification is dangerous in unskilled hands (and it certainly is), then what hope is there? Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well? Read more…