Data skepticism

If data scientists aren't skeptical about how they use and analyze data, who will be?

A couple of months ago, I wrote that “big data” is heading toward the trough of a hype curve as a result of oversized hype and promises. That’s certainly true. I see more expressions of skepticism about the value of data every day. Some of the skepticism is a reaction against the hype; a lot of it arises from ignorance, and it has the same smell as the rich history of science denial from the tobacco industry (and probably much earlier) onward.

But there’s another thread of data skepticism that’s profoundly important. On her MathBabe blog, Cathy O’Neil has written several articles about lying with data — about intentionally developing models that don’t work because it’s possible to make more money from a bad model than a good one. (If you remember Mel Brooks’ classic “The Producers,” it’s the same idea.) In a slightly different vein, Cathy argues that making machine learning simple for non-experts might not be in our best interests; it’s easy to start believing answers because the computer told you so, without understanding why those answers might not correspond with reality.

I had a similar conversation with David Reiley, an economist at Google, who is working on experimental design in social sciences. Heavily paraphrasing our conversation, he said that it was all too easy to think you have plenty of data, when in fact you have the wrong data, data that’s filled with biases that lead to misleading conclusions. As Reiley points out (pdf), “the population of people who sees a particular ad may be very different from the population who does not see an ad”; yet, many data-driven studies of advertising effectiveness don’t take this bias into account. The idea that there are limitations to data, even very big data, doesn’t contradict Google’s mantra that more data is better than smarter algorithms; it does mean that even when you have unlimited data, you have to be very careful about the conclusions you draw from that data. It is in conflict with the all-too-common idea that, if you have lots and lots of data, correlation is as good as causation.

Skepticism about data is normal, and it’s a good thing. If I had to give a one line definition of science, it might be something like “organized and methodical skepticism based on evidence.” So, if we really want to do data science, it has to be done by incorporating skepticism. And here’s the key: data scientists have to own that skepticism. Data scientists have to be the biggest skeptics. Data scientists have to be skeptical about models, they have to be skeptical about overfitting, and they have to be skeptical about whether we’re asking the right questions. They have to be skeptical about how data is collected, whether that data is unbiased, and whether that data — even if there’s an inconceivably large amount of it — is sufficient to give you a meaningful result.

Because the bottom line is: if we’re not skeptical about how we use and analyze data, who will be? That’s not a pretty thought.

tags: , ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

Get the O’Reilly Web Ops and Performance Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.

Get the O’Reilly Programming Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.

Get the O’Reilly Hardware Newsletter

Get weekly insight and knowledge on how to design, prototype, manufacture, and market great connected devices.

Get Four Short Links in Your Inbox

Sign up to receive Nat’s eclectic collection of curated links every weekday.

Get the O’Reilly Design Newsletter

Stay informed. Receive weekly insight from industry insiders.

Get the O’Reilly Web Platform Newsletter

Stay informed. Receive weekly insight from industry insiders—plus exclusive content and offers.

  • wilbur2

    Shrewd hedge fund analysts will be skeptical, motivated by the profits of seeing through disinformation.

  • Alex Tolley

    Without skepticism, data “scientists” will be chalatans.

  • Ken Williams

    The way I usually think about this is that I have to be my own harshest critic, before someone else gets the chance. If I’m lucky that could be someone in a talk I give about my methods and conclusions, but if not, it could be the actual failure of my methods in the field when the company deploys them & expects them to work. If I’ve overfit or otherwise made the analysis too rosy, I’ll be in big trouble later.

  • Tech_fiend

    Completely agree that drawing the right conclusions from your data requires a great deal of skepticism and the evaluation of alternative hypotheses. In my blog: I discuss how this is a required step in most serious disciplines where the stakes are high (science, medicine, crime investigation etc.) This is a role the Analyst has to play, otherwise all you do is use data to confirm your cognitive biases.

    It is true that the cost of a misdirected ad is pretty low, unlike say a misdirected drone strike. So, for all the talk of “Data Science” you find that scientific rigor missing in the digital marketing or search area where Big Data seems to be an end in itself and there’s little incentive to go beyond correlation. This is a luxury afforded only to the few whose entire business model is interacting with customers online.

    Reality will catch up (eventually). When Six Sigma based statistical techniques first came out, they were heralded as the next big thing in business and a silver bullet to get real measurable results. However it has been slow going outside of manufacturing. Similarly, I think the hype of Big Data will cool, as folks realize that these are useful tools, but not as widely applicable to every business model.

    What we definitely need are skeptics who don’t latch on to the latest fad, but can evaluate the pros/cons of the new tools and techniques and see them for what they are: a way to solve a problem or answer a question in a given business context.

  • Albert Ross

    Sadly, Big Data is just another oversold IT fad. We’ve had them before. Going way, way back we had mini-computers. We had ISAM files. We had PCs. We had databases. We had ‘structured methodology’. We had networks and the really big network, the Internet. We had .com. We had ERP. We had ‘rapid development’. We had ‘the cloud’. The latest manifestation of IT product is ‘Big Data’.

    Yes, it’s useful but NO – it’s not the answer to everyone’s prayers. Just as all the other things that IT comes up with aren’t. They are all contributors and some have possibly moved us on. But, at the end of the day, Big Data is just another lot of oversold hype. I know that those in the industry won’t agree. But then, they would be making money out of it, so they wouldn’t.

    Give it a year or two – and there will be something else, just as ‘exciting’.