Untangling algorithmic illusions from reality in big data

Microsoft principal researcher Kate Crawford (@katecrawford) gave a strong talk at last week’s Strata Conference in Santa Clara, Calif. about the limits of big data. She pointed out potential biases in data collection, questioned who may be excluded from it, and hammered home the constant need for context in conclusions. Video of her talk is embedded below:

Crawford explored many of these same topics in our interview, which follows.

What research are you working on now, following up on your paper on big data?

Kate Crawford: I’m currently researching how big data practices are affecting different industries, from news to crisis recovery to urban design. This talk was based on that upcoming work, touching on questions of smartphones as sensors, on dealing with disasters (like Hurricane Sandy), and new epistemologies — or ways we understand knowledge — in an era of big data.

When “Six Provocations for Big Data” came out in 2011, we were critiquing the very early stages of big data and social media. In the two years since, the issues we raised are even more prominent.

I’m now looking beyond social media to a range of other areas where big data is raising questions of social justice and privacy. I’m also editing a special issue on critiques of big data, which will be coming out later this year in the International Journal of Communications.

As more nonprofits and governments look to data analysis in governing or services, what do they need to think about and avoid?

Kate Crawford: Governments have a responsibility to serve all citizens, so it’s important that big data doesn’t become a proxy for “data about everyone.” There are two problems here: first is the question of who is visible and who isn’t represented; the second is privacy, or what I call “privacy practices” — because privacy means different things depending on where and who you are.

For example, the Streetbump app is brilliant. What city wouldn’t want to passively draw on data from all those smartphones out there, a constantly moving network of sensors? But, as we know, there are significant percentages of Americans who don’t have smartphones, particularly older citizens and those with lower disposable incomes. What happens to their neighborhoods if they generate no data? They fall off the map. To be invisible when governments make resource decisions is dangerous.

Then, of course, there’s the whole issue of people signing up to be passively tracked wherever they go. People may happily opt into it, but we’d want to be very careful about who gets that data, and how it is protected over the long term — not just five years, but 50 years and beyond. Governments might be tempted to use that data for other purposes, even civic ones, and this has significant implications for privacy and the expectations citizens have for the use of their data.

Where else could such biases apply?

Kate Crawford: There are many areas where big data bias is a problem from a social equity perspective. One of the key ones at the moment is law enforcement. I’m concerned by some of the work that seeks to “profile” areas, and even people, as likely to be involved in crime. It’s called “predictive policing” (more here). We’ve already seen some problematic outcomes when profiling was introduced for plane travel. Now, imagine what happens if you or your neighborhood falls on the wrong side of a predictive model. How do you even begin to correct the record? Which algorithm do you appeal to?

What are the things, as David Brooks listed recently, that big data can’t do?

Kate Crawford: There are lots of things that big data can’t do. It’s useful to consider the history of knowledge, and then imagine what it would look like if we only used one set of tools, one methodology for getting answers.

This is why I find people like Gabriel Tarde so interesting — he was grappling with ideas of method, big data and small data, back in the late 1800s.

He reminds us of what we can lose sight of when we go up orders of magnitude and try to leave small-scale data behind — like interviewing people, or observing communities, or running limited experiments. Context is key, and it is much easier to be attentive to context when we are surrounded by it. When context is dissolved into so many aggregated datasets, we can start getting mistaken impressions.

When Google Flu Analytics mistakenly predicted that 11% of the US had flu this year, that points to how relying on a big data signal alone may give us an exaggerated or distorted result (in that case, more than double the actual figure, which was between 4.5-4.8%). Now, imagine how much worse it would be if that data was all that health agencies had to work with.

I’m really interested in how we might best combine computational social science with traditional qualitative and ethnographic methods. With a range of tools and perspectives, we’re much more likely to get a three-dimensional view of a problem and be less prone to serious error. This goes beyond tacking on a few focus groups to big datasets, but conjoining deep, ethnographically-informed research with rich data sources.

What can the history of statistics in social science tell us about correlation vs causation? Does big data change that dynamic?

Kate Crawford: This is a gigantic question, and one that could be its own talk! With big datasets, it’s very tempting for researchers to engage in apophenia — seeing patterns where none actually exist — because massive quantities of data can point to a range of correlative possibilities.

For example, David Leinweber showed back in 2007 that data mining techniques could show a strong but spurious correlation between the changes in the S&P 500 stock index and butter production in Bangladesh. There’s
another great correlation between the use of Facebook and the rise of the Greek debt crisis.

With big data techniques, some people argue you can get much closer to being able to predict causal relations. But even here, big data tends to need several steps of preparation (data “cleaning” and pre-processing) and several steps in interpretation (deciding which of many analyses shows a positive result versus a null-result).

Basically, humans are still in the mix, and thus it’s very hard to escape false positives, strained correlations and cognitive bias.