A different take on data skepticism

Recently, the Mathbabe (aka Cathy O’Neil) vented some frustration about the pitfalls in applying even simple machine learning (ML) methods like k-nearest neighbors. As data science is democratized, she worries that naive practitioners will shoot themselves in the foot because these tools can offer very misleading results. Maybe data science is best left to the pros? Mike Loukides picked up this thread, calling for healthy skepticism in our approach to data and implicitly cautioning against a “cargo cult” approach in which data collection and analysis methods are blindly copied from previous efforts without sufficient attempts to understand their potential biases and shortcomings.

Well, arguing against greater understanding of the methods we apply is like arguing against motherhood and apple pie, and Cathy and Mike are spot on in their diagnoses of the current situation. And yet …

There is so much value to be gained if we can put the power of learning, inference, and prediction methods into the hands of more developers and domain experts. But how can we avoid the pitfalls that Cathy and Mike are rightly concerned about? If a seemingly simple method like k-nearest neighbors classification is dangerous in unskilled hands (and it certainly is), then what hope is there? Well, I would argue that all ML methods are not created equal with regard to their safety. In fact, it is exactly some of the simplest (and most widely used) methods that are the most dangerous.

Why? Because these methods have lots of hidden assumptions. Well, maybe the assumptions aren’t so much hidden as nodded-at-but-rarely-questioned. A good analogy might be jumping to the sentencing phase of a criminal trial without first assessing guilt: asking “What is the punishment that best fits this crime?” before asking “Did the defendant actually commit a crime? And if so, which one?” As another example of a simple-yet-dangerous method, k-means clustering assumes a value for k, the number of clusters, even though there may not be a “good” way to divide the data into this many buckets. Maybe seven buckets provides a much more natural explanation than four. Or maybe the data, as observed, is truly undifferentiated and any effort to split it up will result in arbitrary and misleading distinctions. Shouldn’t our methods ask these more fundamental questions as well?

So, which methods are better in this regard? In general, it’s those that explore model space in addition to model parameters. In the case of k-means, for example, this would mean learning the number k in addition to the cluster assignment for each data point. For k-nearest neighbors, we could learn the number of exemplars to use and also the distance metric that provides the best explanation for the data. This multi-level approach might sound advanced, and it is true that these implementations are more complex. But complexity of implementation needn’t correlate with “danger” (thanks in part to software engineering), and it’s certainly not a sufficient reason to dismiss more robust methods.

I find the database analogy useful here: developers with only a foggy notion of database implementation routinely benefit from the expertise of the programmers who do understand these systems — i.e., the “professionals.” How? Well, decades of experience — and lots of trial and error — have yielded good abstractions in this area. As a result, we can meaningfully talk about the database “layer” in our overall “stack.” Of course, these abstractions are leaky, like all others, and there are plenty of sharp edges remaining (and, some might argue, more being created every day with the explosion of NoSQL solutions). Nevertheless, my weekend-project webapp can store and query insane amounts of data — and I have no idea how to implement a B-tree.

For ML to have a similarly broad impact, I think the tools need to follow a similar path. We need to push ourselves away from the viewpoint that sees ML methods as a bag of tricks, with the right method chosen on a per-problem basis, success requiring a good deal of art, and evaluation mainly by artificial measures of accuracy at the expense of other considerations. Trustworthiness, robustness, and conservatism are just as important, and will have far more influence on the long-run impact of ML.

Will well-intentioned people still be able to lie to themselves? Sure, of course! Let alone the greedy or malicious actors that Cathy and Mike are also concerned about. But our tools should make the common cases easy and safe, and that’s not the reality today.

Related:

Data skepticism

A different take on data skepticism

Our tools should make common cases easy and safe, but that's not the reality today.