The creep factor: How to think about big data and privacy

There was a great passage in Alexis Madrigal’s recent interview with Gibu Thomas, who runs innovation at Walmart:

“Our philosophy is pretty simple: When we use data, be transparent to the customers so that they can know what’s going on. There’s a clear opt-out mechanism. And, more important, the value equation has to be there. If we save them money or remind them of something they might need, no one says, “Wait, how did you get that data?” or “Why are you using that data?” They say, “Thank you!” I think we all know where the creep factor comes in, intuitively. Do unto others as you want to be done to you, right?”

This notion of “the creep factor” seems fairly central as we think about the future of privacy regulation. When companies use our data for our benefit, we know it and we are grateful for it.  We happily give up our location data to Google so they can give us directions, or to Yelp or Foursquare so they can help us find the best place to eat nearby. We don’t even mind when they keep that data if it helps them make better recommendations in future. Sure, Google, I’d love it if you can do a better job predicting how long it will take me to get to work at rush hour!  And yes, I don’t mind that you are using my search and browsing habits to give me better search results.  In fact, I’d complain if someone took away that data and I suddenly found that my search results just weren’t as good as they used to be!

But we also know when companies use our data against us, or sell it on to people who do not have our best interests in mind.

When credit was denied not because of your ability to pay but because of where you lived or your racial identity, that was called “redlining,” so called because of the practice of drawing a red line on the map to demarcate geographies where loans or insurance would be denied or made more costly. Well, there’s a new kind of redlining in the 21st century. The Atlantic calls it data redlining:

“When a consumer applies for automobile or homeowner insurance or a credit card, companies will be able to make a pretty good guess as to the type of risk pool they should assign the consumer to. The higher-risk consumers will never be informed about or offered the best deals. Their choices will be limited.

“State Farm is currently offering a discount to customers through a program called Drive Safe & Save. The insurer offers discounts to customers who use services such as Ford’s Sync or General Motors’ OnStar, which, among other things, read your odometer remotely so that customers no longer have to fuss with tracking how many miles they drive to earn insurer discounts. How convenient!”

“State Farm makes it seem that it’s only your mileage that matters but imagine the potential for the company once it has remote access to your car. It will know how fast you drive on the freeway even if you don’t get a ticket. It will know when and where you drive. What if you drive on routes where there are frequent accidents? Or what if you park your car in high-crime areas?”

In some ways, the worst case scenario in the last paragraph above is tinfoil hat stuff. There is no indication that State Farm Insurance is actually doing those things, but we can see from that example where the boundaries of fair use and analysis might lie. It seems to me that insurance companies are quite within their rights to offer lower rates to people who agree to drive responsibly, and to verify the consumer’s claims of how many miles they drive annually, but if my insurance rates suddenly spike because of data about formerly private legal behavior, like the risk profile of where I work or drive for personal reasons, I have reason to feel that my data is being used unfairly against me.

Similarly, if I don’t have equal access to the best prices on an online site, because the site has determined that I have either the capacity or willingness to pay more, my data is being used unfairly against me.

The right way to deal with data redlining is not to prohibit the collection of data, as so many misguided privacy advocates seem to urge, but rather,  to prohibit its misuse once companies have that data.   As David Brin, author of the prescient 1998 book on privacy, The Transparent Society, noted in a conversation with me last night, “It is intrinsically impossible to know if someone does not have information about you. It is much easier to tell if they do something to you.”

Furthermore, because data is so useful in personalizing services for our benefit, any attempt to prohibit its collection will quickly be outrun by consumer preference, much as the Germans simply routed around France’s famed Maginot Line at the outset of World War II.  For example, we are often asked today by apps on our phone if it’s OK to use our location. Most of the time, we just say “yes,” because if we don’t, the app just won’t work. Being asked is an important step, but how many of us actually understand what is being done with the data that we have agreed to surrender?

The right way to deal with data redlining is to think about the possible harms to the people whose data is being collected, and primarily to regulate those harms, rather than the collection of the data itself, which can also be put to powerful use for those same people’s benefit. When people were denied health coverage because of pre-existing conditions, that was their data being used against them; this is now restricted by the Affordable Care Act. By contrast, the privacy rules in HIPAA, the 1996 Health Information Portability and Accountability Act, which seek to set overly strong safeguards around the privacy of data, rather than its use, have had a chilling effect on many kinds of medical research, as well as patients’ access to their very own data!

Another approach is shown by legal regimes such as that controlling insider trading: once you have certain data, you are subject to new rules, rules that may actually encourage you to avoid gathering certain kinds of data.  If you have material nonpublic data obtained from insiders, you can’t trade on that knowledge, while knowledge gained by public means is fair game.

I know there are many difficult corner cases to think through. But the notion of whether data is being used for the benefit of the customer who provided it (either explicitly, or implicitly through his or her behavior), or is being used against the customer’s interests by the party that collected it, provides a pretty good test of whether or not we should consider that collecting party to be “a creep.”

For more information on big data and privacy — and to get involved in the conversation — subscribe to the free Data Newsletter.


Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • John B

    Gotta vehemently disagree that blindly accepting requests for access to my devices defaults to ‘yes’ – quite the opposite is true.

    Also, how is the average user expected to be able to discern when they’ve been discriminated against? Most are unaware that the $5 coffeeshop card they asked for exposess their physical & online identities for marketting, as well as likely their preferred type of beverage, preferred coffeeshop, time they make such purchases, etc, and that’s with nothing more malicious than a cash register at said coffeeshop.

    How SHOULD a State Farm insurance client know why their rates went up? Do you honestly expect State Farm (or any other business) to admit their creepiness instead of finding other rationalizations?

    • John,

      I think you’re totally right that people saying “yes” to apps requests to share their location data is a blank check for companies to do anything they want with that data. But people do say “yes,” so what to do? I am arguing that what we want to do is to figure out specific harms that we want to prohibit or penalize, precisely because people will say “yes” so thoughtlessly, and because once that permission is given, it is very difficult to tell just what data companies have or how they are using it.

      Auditing “data harm” is difficult, but not more difficult than, say, Google managing search quality or combating search spam. It’s a good use case for what I’ve called “algorithmic regulation.”

      • John B

        Fair point, I’d be very curious to see a list of those, ideally including reasonable actions which could allow parties external to the relationship to validate the existence and degree of said harms. I suspect it would take a very long time to achieve that in a society with anything remotely resembling current notions of privacy and freedom, however.

        Your examples In the second part seem incongruous to me. Google has market forces that drive them to improve search quality & reducing spam. I am unable to frame a similar market force that should help drive auditing their potential data harm(s). This implies to me that regulation may be the external driving force, something Google and every other major business has great gobs of expertise in avoiding the intent by meeting some minimal subset.

  • Another really provocative take on the political issues around big data comes from David Eaves, the Canadian open government activist. He writes about a different kind of data redlining: the data that isn’t collected in order to affect the operations of government.

    He points out, for example, that North Carolina has legislated what kind of climate data can be used to predict sea level rise. And he points out the role of the US census in shaping everything from congressional seats to budgeting:

    “As accessibility becomes less politicized, how governments collect data will become the new political battlefield. The most relevant “open” U.S. government data set may be the census. The grand history of disputes over its seemingly benign numbers—what questions to ask, what methodologies to use, what to do about the information—is emblematic of the bickering on the horizon. The census is so contentious because the stakes are so high: Its results determine seat counts in Congress, as well as how more than $400 billion in federal and state funds are allocated. Yet the numbers have long been plagued by inaccuracies. The 1990 census failed to include an estimated 8 million immigrants and urban minorities while double-counting roughly 4 million white Americans.

    “If one party is able to legislate how data are collected, subsequent battles may not matter—it can, in effect, create an “official” reality that serves a broader goal. The 1990 census became so problematic and partisan it ended up in the Supreme Court. A 1999 court ruling that statistical models and sampling could not be used to reapportion congressional seats is just a taste of what is to come as Big Data’s results come to influence even more decisions.”

  • Boris J.

    For me the “creep factor” is also on the other side of the desk – if you´re a collector of data and feel responsible for it. As i founded the first droidcon developer conference in Berlin in 2009, we used cloud services for storing data about attendees, financial issues and other topics without thinking about it in more detail. Five years and many events in a dozen of countries later, i changed our use of tools and get back into “oldfashioned” local storage solutions. Not because i´m aware of concrete misuse, but out of a higher sensibility what could be done with it by govermental or criminal attacks on our IT infrastructure. Especially after growing into areas like Eastern Europe, the Middle East and parts of Asia, i´m more than ever responsible for all sensible data that is given to us by our community. Not to mention the first NY droidcon this year on “nsa ground”. I´m not concerned about the things i know that could be done with our data, my “creepy thoughts” are spinning around (mis)use i´m not even thinking about.

  • leelive

    We need to create a world where benefits derived from our personal data
    come via trusted agents acting in our best interest. Easy to say, hard to
    accomplish. Until then the only thing we can be sure of is our “service provider’s”
    benefit takes priority, and they’re not likely to open their books.

  • Glen Turpin

    Gibu Thomas: “I think we all know where the creep factor comes in, intuitively.”

    There are far too many creepy business and government data collection practices for that to be a true statement. Or, if it is true, the creep factor is often ignored in service of organizational goals.

  • Master_Gill_Bates

    Evgeny Morozov is right; you are a naive ass. What you call “tinfoil hat stuff” is what Progressive Insurance calls its “Snapshot” program .It was formerly called MyRate and has been around since 2008. Is this typical of the amount of research and analysis you do before you pontificate on complex issues with potentially devastating cosequences?

    Secondly, the reason people who are smarter than you want to prohibit the collection of personal data is because it is fundamentally impossible to “prohibit its misuse once companies have that data.” As the example of Lily Ledbetter shows, it’s difficult to catch even the most blatant forms of discrimination. How exactly do you think you will know if you aren’t getting “equal access to the best prices on an online site”?

    The problem with letting simplistic people spout off about policy is that other people will be hurt by it. Your writing “Whoops! My bad…” when the consequences come home to roost will not fix the problems your shortsighted nonsense creates.

  • marion47

    I believe there are various factors that drive
    racial discrimination