The backlash against big data, continued

Ignore the hype. Learn to be a data skeptic.

Yawn. Yet another article trashing “big data,” this time an op-ed in the Times. This one is better than most, and ends with the truism that data isn’t a silver bullet. It certainly isn’t.

I’ll spare you all the links (most of which are much less insightful than the Times piece), but the backlash against “big data” is clearly in full swing. I wrote about this more than a year ago, in my piece on data skepticism: data is heading into the trough of a hype curve, driven by overly aggressive marketing, promises that can’t be kept, and spurious claims that, if you have enough data, correlation is as good as causation. It isn’t; it never was; it never will be. The paradox of data is that the more data you have, the more spurious correlations will show up. Good data scientists understand that. Poor ones don’t.

It’s very easy to say that “big data is dead” while you’re using Google Maps to navigate downtown Boston. It’s easy to say that “big data is dead” while Google Now or Siri is telling you that you need to leave 20 minutes early for an appointment because of traffic. And it’s easy to say that “big data is dead” while you’re using Google, or Bing, or DuckDuckGo to find material to help you write an article claiming that big data is dead.

Big data isn’t dead, though I only use the word “big” under duress. It’s just data. There’s more of it around than there used to be; we have better tools to generate, capture, and store it. As I argued in the beginning of 2013, the mere existence of data will drive the exploration and analysis of data. There’s no reason to believe this will stop.

That said, let’s look at one particular point from the Times op-ed: successful data analysis depends critically on asking the right question. It’s not so much a matter of “garbage in, garbage out” as it is “ask the wrong question, you get the wrong answer.” And here, the author of the Times piece is at least as uncritical as the data scientists he’s criticizing. He criticizes Steven Skiena and Charles Ward, authors of Who is Bigger, along with MIT’s Pantheon project, for the claim that Francis Scott Key was the 19th most important poet in history, and Jane Austin was only the 78th most important writer, and George Eliot the 380th.

Of course, this hinges on the meaning of “important.” If “important” means “central to the musical or literary canon,” then yes, the data-driven results are nonsense. But I wouldn’t expect data analysis to give me the same results I could get by talking to musicologists or literature professors. If by important, we mean that the works somehow drove historical events, I would expect the author of “The Star Spangled Banner” (to say nothing of the authors of “The Marsellaise”) to outrank Keats. People don’t fight wars citing Keats’ Ode on a Grecian Urn.

The Pantheon project doesn’t use the word “important”; it measures global historical popularity, which is something quite different. And their result just isn’t very surprising. It is easy to forget how many authors there are; coming in 78th is not a bad showing when you’re competing with Homer, Shakespeare, and Dante. I am certainly not in a position to debate whether Austen is more or less popular than the Japanese 17th century author Basho (52) or, for that matter, Nostradamus (20).

What do we mean by importance? What do we mean by influence? What do we mean by popularity? These are the sorts of questions you have to ask before doing any data analysis. I haven’t read Who is Bigger, but the Pantheon site does an excellent job of discussing its methodology, biases and limitations. And it provides an excellent foundation for a more important, nuanced discussion of popularity, influence, and importance.

There is a lot of hype about “big data,” and much of it is ridiculous. Ignore the hype. Learn to be a data skeptic. That doesn’t mean becoming skeptical about the value of data; it means asking the hard questions that anyone claiming to be a data scientist should ask. Think carefully about the questions you’re asking, the data you have to work with, and the results that you’re getting. And learn that data is about enabling intelligent discussions, not about turning a crank and having the right answer pop out.

Data is data. It was valuable 50 years ago, when IBM released the first model 360. It’s more valuable today.

tags: ,