The dangers of data-driven list-making

Such lists might mean we miss the truly great breakthroughs, inspirations, and leaps of faith necessary to evolve.

Editor’s note: this post originally appeared on Tilt the Windmill; it is republished here with permission.

Startupfest’s Pamela Perotti asked for my thoughts on this great Forbes piece by Lightspeed’s Barry Eggers about using big data to build top ten lists that actually matter.

First: it’s an excellent post. You should read it. I’ll wait.

Every enterprise decision-maker will soon be running their business according to the lists Barry envisions, as the power of big data and analytics finds its way into every boardroom and dashboard. Society will soon demand them, too. But while such analysis is tremendously valuable, it carries two dangers: the politics of setting criteria, and the trap of relying on data for inspiration.

The harsh light of data

Barry is right: rather than using our precious time and resources to make yet another linkbait list of the 50 cutest kittens, or the seven people I’ll try to avoid at SXSW, we should use abundant data and a connected world to build lists that matter: lying politicians, bad cars, lousy doctors. Then we can use these lists to change policy and behaviour because we’ll make things transparent. Shining the harsh light of data on something can improve it.

Unfortunately, expecting big data to be a panacea that cures all our ills is overreaching and can lead to the kind of hype that scuttles otherwise ascendant technologies.

The unquestionable truth is that we optimize better with data. We can indeed build good lists. Assuming the criteria we choose are tied to outcomes we want, and reasonably objective — politicians’ lies, cars’ maintenance costs, doctors’ effectiveness — that’s a good thing.

What’s even better, and what Eggers’ article overlooks, is that because the lists are generated by software, they can be tailored. Each of us can have our own list because the cost of producing a list is nearly zero. Rather than just a list of safe cars, what about the safest car for Pamela, based on her driving record, weight, and the climate around her house? A tailored list is far better than an average one, and data makes it possible to customize lists so they’re as good as possible.

(That’s also one of the big reasons we’re willing to disclose some things about ourselves: we get better results. A drop in privacy is an increase in utility.)

There are two big downsides

The first big downside to data-driven list-making is the question of who gets to set the criteria for a list. A list of the worst foods might include bacon and foie gras, but there are plenty of folks who’d rank those particular dishes among the best because they care about taste rather than health or ethics. Walking home might be better for your health, but it means you have a half hour less to play with your children. Everything’s a tradeoff, and someone has to decide what the right criteria are.

This is policy-making at its finest. And it’s never easy. Just try to get 10 people to agree on what constitutes a “good” politician when five of them support the NRA and five support Obamacare, and you’ll see how quickly this kind of criteria-setting devolves into a nanny-state argument for anything that’s even slightly subjective.

The second downside is an over-reliance on data. Any time you try to optimize something, you run into a problem known as a local maximum. You’re using data and algorithms to do the best thing you can within your current model. But the scope of your model might be wrong.

Imagine, for example, a lemonade stand. You use data to optimize everything — cost of lemons, pricing, where to set up, and so on. You’re making a ton of profit.

But perhaps you could make way more profit by selling iced tea. Because you framed your business as a “lemonade stand” and not a “refreshment provider,” you missed the opportunity to get to an even better position — what’s called a global maximum.

Of course, maybe you should be a refreshment truck or a sandwich shop. There’s no end to the scope you might chose, and you have to constrain things somehow, otherwise you have no business model. Framing is necessary because it lets you focus on a business, rather than running madly in all directions. But a tight focus on the current business model means that surrounding opportunities vanish into the periphery.

Put another way, we sometimes mistake optimization for inspiration. Data is for optimization; humans are for inspiration. Expecting the former to give you the latter is a bad thing.

Local maxima kill companies

Missing a global maximum — more specifically, failure to frame (and reframe) the business — is what ends big, incumbent companies. Think about Kodak, which pioneered film but missed digital cameras and smartphones. They failed to reframe their business as “sharing memories.” When you think of a picture that way, through modern eyes, an Internet connection is a pretty obvious feature of a picture-taking device.

Want another example? Blockbuster ran afoul of the local maximum when it thought it was in the video store business and got addicted to late fees as a source of revenue. No amount of optimization through data would have told Blockbuster, “get less money from your customers by mailing them DVDs with no return date.”

Framing problems are obvious in hindsight. A decade ago we didn’t have a smartphone in every pocket, and Kodak’s demise wasn’t clear to everyone. When one of Blackberry’s CEOs wondered why anyone would want a camera on their phone, he was guilty of the same thing.

“…it was the ‘candy bar’ format, and it had a track wheel, and it had really good connectivity. It was really nice for scrolling around, and it could play video, and it had a camera. Up until that point, Mike (Lazaridis) had said, ‘That’s crazy, why would I ever want a camera?’ All of a sudden, BlackBerry becomes a consumer play.” (from Micheal A. Levin)

Are lists of cars and politicians the best way to fix driving and politics?

Back to the examples presented in the article. Perhaps the best, safest car is a self-driving car. Studies suggest that this is the case. But early on, such cars need a lot of maintenance (perhaps because there aren’t enough of them made to work out the bugs). They’re also expensive (because the manufacturers can’t amortize the cost of invention across many vehicles.) So year after year, the self-driving cars come in less reliable and more expensive on the list. If people listen to the list, the cars don’t get bought. We’re stuck with dangerous, human-driven vehicles. Innovation grinds to a halt.

For another example from the Forbes piece, consider lying politicians. We might see a ranking of dishonest members of Congress, but the algorithm or the data is unlikely to step back and say, “maybe representative government is a hack. Maybe lobbying is literally misrepresentation. Maybe in an era of Facebook, we don’t need representatives, and instead we should use digital voting and a direct democracy.”

In other words, data-driven optimization is great for doing the best at the game we’re currently playing; it’s awful for changing the rules or switching to a different game.

I love the idea of big data helping us better understand and optimize the world around us. I do think it gives an uncomfortable amount of power to those who create the tools that make the lists, and I question whether such lists will mean we miss the truly great breakthroughs, inspirations, and leaps of faith through which we evolve as a species.



Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Sam Penrose

    Lying politicians is a wonderful example precisely because it highlights foolishness in smarty-pants commentators. Politicians represent millions of people who disagree with each other about, for example, whether the earth is 6,000 or 13.8 billion years old, and sometimes worse, who agree with each other (in 1940) that America should not send its sons to die in Europe’s squabbles. Thank goodness for lying mealy-mouthed politicians who understand that leadership is much too important to squander on Asperger’s-y notions of the ethics of public communication.

  • Doug K

    I like your analysis, except for the part that says the Forbes article is excellent. It’s vacuous. We expect this kind of bloviating from Forbes, but it is dispiriting to find Radar endorsing it.

    The Forbes piece is a classic bit of ‘big-picture guy’ ‘vision thing’ ‘pointy-haired-boss’ thinking. The list might as well include
    4. a pony
    Apparently the writer is so far above the messy details of implementation as to be unable to discern even the outline of the problem. This isn’t vision, it’s blindness.

    To particulars:
    1. lying politicians.
    Implementing the fact checking requires solving the problem of machine understanding of natural language; parsing what you mean by ‘is’. Big data can’t solve that. In any case as Sam notes, honesty is only one trait of politicians and not necessarily the most important – I don’t want a politician who honestly does not believe in anthropogenic global warming, I want a politician capable of rational thought, who faces reality and deals with it as best s/he can.

    2. the best and worst doctors.
    The data on doctors is held very closely, as it is part of their competitive advantage in our healthcare industry. Simply to assemble this data would require breaking and reforming the entire industry as single-payer. As it is we can’t even start getting the data, never mind analyzing it. Once we have the data, again there is the problem of getting machine learning to outperform human diagnosis – how do we ascertain ‘the accuracy of diagnosis’ otherwise ? Teams of experts, reviewing every diagnosis ? That doesn’t scale.

    3. reliable cars
    Consumer Reports is dismissed as ‘so subjective that they lack any true information utility.’
    CR has been doing big data for decades now – collecting data on reliability and subjecting it to statistical analysis. As such this problem is solved already, and the lists are available on CR’s website. In any case this is utterly irrelevant to Detroit or saving it. Detroit already knows its cars are unreliable, and has done for many years now. Changing that is a bigger and harder problem.

  • mitch696969

    Again this is very dangerous ground to use lists to support bigotry to change behavior and should be largely illegal. Priceline is a different example of a false money saving guide and should also be illegal because it is a false truth. We need to demand inventors understand what real transparency is. Real transparency is revealing what you are collecting and revealing what you are doing with it. Building hardware, software that makes this difficult to see is at the core of today’s problems. If people only realized what evil was perpetrated with their data they just might get a lot less accommodating…. This is a very good thing to get some better morals in the industry. We should also demand that carriers of sensitive information carry more extensive responsibility for information leaks, thus make information a little less attractive to bank in memory. The whole Internet is a mess in it’s current condition. They are already making the next huge blunder which is to let automation blend with existing systems without separating the two thoroughly. Soon we will find ourselves competing with automated objects in cyberspace because that’s how hackers in the industry wanted it. The time for domain-based information delivery that is much more robust in security protocols is long overdue because the private sector isn’t interested.

  • mitch696969

    A list can never be anything more than an opinion by Information Technology standards and opinions can be problematic.