The vanishing cost of guessing

As society becomes increasingly data driven, it's critical to remember big data isn't a magical tool for predicting the future.

If you eat ice cream, you’re more likely to drown.

That’s not true, of course. It’s just that both ice cream and swimming happen in the summer. The two are correlated — and ice cream consumption is a good predictor of drowning fatalities — but ice cream hardly causes drowning.

These kinds of correlations are all around us, and big data makes them easy to find. We can correlate childhood trauma with obesity, nutrition with crime rates, and how toddlers play with future political affiliations.

Just as we wouldn’t ban ice cream in the hopes of preventing drowning, we wouldn’t preemptively arrest someone because their diet wasn’t healthy. But a quantified society, awash in data, might be tempted to do so because overwhelming correlation looks a lot like causality. And overwhelming correlation is what big data does best.

It’s getting easier than ever to find correlations. Parallel computing, advances in algorithms, and the inexorable crawl of Moore’s Law have dramatically reduced how much it costs to analyze a data set. Consider an activity we do dozens of times a day, without thinking: a Google search. The search is farmed out to thousands of machines, and often returns hundreds of answers in less than a second. Big data might seem esoteric, but it’s already here.

Google’s search results aren’t the right results; they’re those that are most likely to be related to what you searched for. Similarly, Watson, IBM’s Jeopardy-winning software, mined millions of records to guess at the right answer. Today, an abundance of cheap, simple tools makes it trivial for organizations to guess rather than to know about everything from employee honesty to the spread of disease to the optimal delivery of car parts in a snow-bound city to whether a teenager is pregnant.

Tomorrow’s data-driven society is both smarter and dumber, more just and more merciless. The ethical implications of this shift are only now becoming clear: at some point, innocent-until-proven-guilty looks a lot like innocent-until-likely-to-be guilty.

What the big data revolution is really about is predicting the future. Whether it’s choosing the right ad to show a web visitor, or setting the optimal insurance premium, or helping an inner-city student learn better, we crunch reams of data to try to predict what will happen.

Proponents see this as a boon to humanity. Big data makes us smart: we can anticipate a flu outbreak or where charitable donations do the most good. It also makes us just: transparent, open information and the tools to analyze it shine the harsh light of data on corruption, replacing opinions with facts.

On the other hand, critics charge that big data will make us stick to constantly optimizing what we already know, rather than thinking out of the box and truly innovating. We’ll rely on machines for evolutionary improvements, rather than revolutionary disruption. An abundance of data means we can find facts to support our preconceived notions, polarizing us politically and dividing us into “filter bubbles” of like-minded intolerance. And it’s easy to mistake correlation for causality, leading us to deny someone medical coverage or refuse them employment because of a pattern over which they have no control, taking us back to the racism and injustice of Apartheid or Redlining.

Big data isn’t a magical tool for predicting the future. It’s not a way to peer into someone’s soul or decide what’s going to happen, even though it’s often frighteningly good at guessing. Just because the cost of guessing is dropping quickly to zero doesn’t mean we should treat a guess as the truth. As we become an increasingly data-driven society, it’s critical that we remember we can no more predict tomorrow with today’s data than we can prevent drowning by banning ice cream.

tags: ,

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • raghuhavaldar

    As with anything, there will be certain class problems that may not be most suitable for Big Data, but this will get us much further with many of our challenges

  • jakeporway

    Great post. In my mind this article highlights the need for greater public statistical literacy. Problems of misinterpreting data aren’t new. The foundations of experimental design in statistics were rooted in a quest to take the guesswork out of jumping from correlation to causation. The ways data can be used and abused have been long known (see How to Lie With Statistics for a great digestible account), it’s just that the explosion in available data and data tools means that there’s a whole new slew of untrained citizens who now have the ability to commit those sins unwittingly. Now that we’re getting past some of the engineering hurdles of Big Data, we’re finding ourselves faced with the new familiar problem of analyzing data, a.k.a. applied statistics. We should remind ourselves of what we already know :)

  • Scott Berkun

    Alistair: What recommendations do you have for how, in a meeting of people with full faith in big data, to politely ground decisions in the wisdom you’re offering here?

    Wishful thinking is far more powerful than logic, even among smart and wise people. Until the trend passes with whatever follows, many people are blinded by data faith.

  • Alidayyy

    I found the entire Software Tailor team to very helpful and knowledgeable. You help me understand the differences between the other systems on the market and the SoftwareTailor system.

    Their website

  • GeorgePR

    big data is all but smart and telling the future, it is just following “patterns” and eventually a big regression to the dumbest. Newton was not in any pattern, neither Einstein or any of the genius that gave us the stupid world we are now living in. The big data techniques are so skewed and biased that sir Fisher is probably revolting in his grave

  • shparekh

    The article says its easier these days to mine data. Where are these data sources and how can I access them. Say I would like to understand the CO2 levels in the urban areas during various times of the day. How do I go about accessing this data? Any specific sites, search engines, etc.? Google is so overwhelming that it is not productive.

  • Reto Matter

    Congrats, Alistair, this is a great article! I think this is one of the key problems we have to cope with when dealing with the advances in Big Data and Predictive Analytics. It is very tempting to view the results you get from mining and analyzing all these big data sets as reality, which of course it is not at all. I can see a dangerous trend of reversing the “burden of proof”. For instance, health insurance companies could stop viewing people as healthy or ill but rather start viewing them all as ill to a certain percentage. In the end, that is how risk is modeled, right? Imagine the equivalent trend in law, and Philip K. Dick’s Minority Report won’t be that far away…

  • Alistair, a smart article, but do your concluding words really express your thoughts, that “we can no more predict tomorrow with today’s data than we can prevent drowning by banning ice cream”? That statement is simply and demonstrably incorrect. Based on data available today, we know what minute the sun’s going to come up tomorrow and what the weather is highly likely to be, within a few degrees. We predict myriad things about tomorrow. Perhaps you should rephrase?