The vanishing cost of guessing

If you eat ice cream, you’re more likely to drown.

That’s not true, of course. It’s just that both ice cream and swimming happen in the summer. The two are correlated — and ice cream consumption is a good predictor of drowning fatalities — but ice cream hardly causes drowning.

These kinds of correlations are all around us, and big data makes them easy to find. We can correlate childhood trauma with obesity, nutrition with crime rates, and how toddlers play with future political affiliations.

Just as we wouldn’t ban ice cream in the hopes of preventing drowning, we wouldn’t preemptively arrest someone because their diet wasn’t healthy. But a quantified society, awash in data, might be tempted to do so because overwhelming correlation looks a lot like causality. And overwhelming correlation is what big data does best.

It’s getting easier than ever to find correlations. Parallel computing, advances in algorithms, and the inexorable crawl of Moore’s Law have dramatically reduced how much it costs to analyze a data set. Consider an activity we do dozens of times a day, without thinking: a Google search. The search is farmed out to thousands of machines, and often returns hundreds of answers in less than a second. Big data might seem esoteric, but it’s already here.

Google’s search results aren’t the right results; they’re those that are most likely to be related to what you searched for. Similarly, Watson, IBM’s Jeopardy-winning software, mined millions of records to guess at the right answer. Today, an abundance of cheap, simple tools makes it trivial for organizations to guess rather than to know about everything from employee honesty to the spread of disease to the optimal delivery of car parts in a snow-bound city to whether a teenager is pregnant.

Tomorrow’s data-driven society is both smarter and dumber, more just and more merciless. The ethical implications of this shift are only now becoming clear: at some point, innocent-until-proven-guilty looks a lot like innocent-until-likely-to-be guilty.

What the big data revolution is really about is predicting the future. Whether it’s choosing the right ad to show a web visitor, or setting the optimal insurance premium, or helping an inner-city student learn better, we crunch reams of data to try to predict what will happen.

Proponents see this as a boon to humanity. Big data makes us smart: we can anticipate a flu outbreak or where charitable donations do the most good. It also makes us just: transparent, open information and the tools to analyze it shine the harsh light of data on corruption, replacing opinions with facts.

On the other hand, critics charge that big data will make us stick to constantly optimizing what we already know, rather than thinking out of the box and truly innovating. We’ll rely on machines for evolutionary improvements, rather than revolutionary disruption. An abundance of data means we can find facts to support our preconceived notions, polarizing us politically and dividing us into “filter bubbles” of like-minded intolerance. And it’s easy to mistake correlation for causality, leading us to deny someone medical coverage or refuse them employment because of a pattern over which they have no control, taking us back to the racism and injustice of Apartheid or Redlining.

Big data isn’t a magical tool for predicting the future. It’s not a way to peer into someone’s soul or decide what’s going to happen, even though it’s often frighteningly good at guessing. Just because the cost of guessing is dropping quickly to zero doesn’t mean we should treat a guess as the truth. As we become an increasingly data-driven society, it’s critical that we remember we can no more predict tomorrow with today’s data than we can prevent drowning by banning ice cream.

The vanishing cost of guessing

As society becomes increasingly data driven, it's critical to remember big data isn't a magical tool for predicting the future.

Get the O’Reilly Data Newsletter