The creep factor: How to think about big data and privacy

There was a great passage in Alexis Madrigal’s recent interview with Gibu Thomas, who runs innovation at Walmart:

“Our philosophy is pretty simple: When we use data, be transparent to the customers so that they can know what’s going on. There’s a clear opt-out mechanism. And, more important, the value equation has to be there. If we save them money or remind them of something they might need, no one says, “Wait, how did you get that data?” or “Why are you using that data?” They say, “Thank you!” I think we all know where the creep factor comes in, intuitively. Do unto others as you want to be done to you, right?”

This notion of “the creep factor” seems fairly central as we think about the future of privacy regulation. When companies use our data for our benefit, we know it and we are grateful for it. We happily give up our location data to Google so they can give us directions, or to Yelp or Foursquare so they can help us find the best place to eat nearby. We don’t even mind when they keep that data if it helps them make better recommendations in future. Sure, Google, I’d love it if you can do a better job predicting how long it will take me to get to work at rush hour! And yes, I don’t mind that you are using my search and browsing habits to give me better search results. In fact, I’d complain if someone took away that data and I suddenly found that my search results just weren’t as good as they used to be!

But we also know when companies use our data against us, or sell it on to people who do not have our best interests in mind.

When credit was denied not because of your ability to pay but because of where you lived or your racial identity, that was called “redlining,” so called because of the practice of drawing a red line on the map to demarcate geographies where loans or insurance would be denied or made more costly. Well, there’s a new kind of redlining in the 21st century. The Atlantic calls it data redlining:

“When a consumer applies for automobile or homeowner insurance or a credit card, companies will be able to make a pretty good guess as to the type of risk pool they should assign the consumer to. The higher-risk consumers will never be informed about or offered the best deals. Their choices will be limited.

“State Farm is currently offering a discount to customers through a program called Drive Safe & Save. The insurer offers discounts to customers who use services such as Ford’s Sync or General Motors’ OnStar, which, among other things, read your odometer remotely so that customers no longer have to fuss with tracking how many miles they drive to earn insurer discounts. How convenient!”

“State Farm makes it seem that it’s only your mileage that matters but imagine the potential for the company once it has remote access to your car. It will know how fast you drive on the freeway even if you don’t get a ticket. It will know when and where you drive. What if you drive on routes where there are frequent accidents? Or what if you park your car in high-crime areas?”

In some ways, the worst case scenario in the last paragraph above is tinfoil hat stuff. There is no indication that State Farm Insurance is actually doing those things, but we can see from that example where the boundaries of fair use and analysis might lie. It seems to me that insurance companies are quite within their rights to offer lower rates to people who agree to drive responsibly, and to verify the consumer’s claims of how many miles they drive annually, but if my insurance rates suddenly spike because of data about formerly private legal behavior, like the risk profile of where I work or drive for personal reasons, I have reason to feel that my data is being used unfairly against me.

Similarly, if I don’t have equal access to the best prices on an online site, because the site has determined that I have either the capacity or willingness to pay more, my data is being used unfairly against me.

The right way to deal with data redlining is not to prohibit the collection of data, as so many misguided privacy advocates seem to urge, but rather, to prohibit its misuse once companies have that data. As David Brin, author of the prescient 1998 book on privacy, The Transparent Society, noted in a conversation with me last night, “It is intrinsically impossible to know if someone does not have information about you. It is much easier to tell if they do something to you.”

Furthermore, because data is so useful in personalizing services for our benefit, any attempt to prohibit its collection will quickly be outrun by consumer preference, much as the Germans simply routed around France’s famed Maginot Line at the outset of World War II. For example, we are often asked today by apps on our phone if it’s OK to use our location. Most of the time, we just say “yes,” because if we don’t, the app just won’t work. Being asked is an important step, but how many of us actually understand what is being done with the data that we have agreed to surrender?

The right way to deal with data redlining is to think about the possible harms to the people whose data is being collected, and primarily to regulate those harms, rather than the collection of the data itself, which can also be put to powerful use for those same people’s benefit. When people were denied health coverage because of pre-existing conditions, that was their data being used against them; this is now restricted by the Affordable Care Act. By contrast, the privacy rules in HIPAA, the 1996 Health Information Portability and Accountability Act, which seek to set overly strong safeguards around the privacy of data, rather than its use, have had a chilling effect on many kinds of medical research, as well as patients’ access to their very own data!

Another approach is shown by legal regimes such as that controlling insider trading: once you have certain data, you are subject to new rules, rules that may actually encourage you to avoid gathering certain kinds of data. If you have material nonpublic data obtained from insiders, you can’t trade on that knowledge, while knowledge gained by public means is fair game.

I know there are many difficult corner cases to think through. But the notion of whether data is being used for the benefit of the customer who provided it (either explicitly, or implicitly through his or her behavior), or is being used against the customer’s interests by the party that collected it, provides a pretty good test of whether or not we should consider that collecting party to be “a creep.”

For more information on big data and privacy — and to get involved in the conversation — subscribe to the free Data Newsletter.

The creep factor: How to think about big data and privacy

Get the O’Reilly Data Newsletter