Big data is our generation’s civil rights issue, and we don’t know it

What the data is must be linked to how it can be used.

Data doesn’t invade people’s lives. Lack of control over how it’s used does.

What’s really driving so-called big data isn’t the volume of information. It turns out big data doesn’t have to be all that big. Rather, it’s about a reconsideration of the fundamental economics of analyzing data.

For decades, there’s been a fundamental tension between three attributes of databases. You can have the data fast; you can have it big; or you can have it varied. The catch is, you can’t have all three at once.

The big data trifecta

I’d first heard this as the “three V’s of data”: Volume, Variety, and Velocity. Traditionally, getting two was easy but getting three was very, very, very expensive.

The advent of clouds, platforms like Hadoop, and the inexorable march of Moore’s Law means that now, analyzing data is trivially inexpensive. And when things become so cheap that they’re practically free, big changes happen — just look at the advent of steam power, or the copying of digital music, or the rise of home printing. Abundance replaces scarcity, and we invent new business models.

In the old, data-is-scarce model, companies had to decide what to collect first, and then collect it. A traditional enterprise data warehouse might have tracked sales of widgets by color, region, and size. This act of deciding what to store and how to store it is called designing the schema, and in many ways, it’s the moment where someone decides what the data is about. It’s the instant of context.

That needs repeating:

You decide what data is about the moment you define its schema.

With the new, data-is-abundant model, we collect first and ask questions later. The schema comes after the collection. Indeed, big data success stories like Splunk, Palantir, and others are prized because of their ability to make sense of content well after it’s been collected — sometimes called a schema-less query. This means we collect information long before we decide what it’s for.

And this is a dangerous thing.

When bank managers tried to restrict loans to residents of certain areas (known as redlining) Congress stepped in to stop it (with the Fair Housing Act of 1968). They were able to legislate against discrimination, making it illegal to change loan policy based on someone’s race.

Home Owners' Loan Corporation map showing redlining of hazardous districts in 1936
Home Owners’ Loan Corporation map showing redlining of “hazardous” districts in 1936.

“Personalization” is another word for discrimination. We’re not discriminating if we tailor things to you based on what we know about you — right? That’s just better service.

In one case, American Express used purchase history to adjust credit limits based on where a customer shopped, despite his excellent credit limit:

Johnson says his jaw dropped when he read one of the reasons American Express gave for lowering his credit limit: “Other customers who have used their card at establishments where you recently shopped have a poor repayment history with American Express.”

Some of the things white men liked in 2010, according to OKCupidWe’re seeing the start of this slippery slope everywhere from tailored credit-card limits like this one to car insurance based on driver profiles. In this regard, big data is a civil rights issue, but it’s one that society in general is ill-equipped to deal with.

We’re great at using taste to predict things about people. OKcupid’s 2010 blog post “The Real Stuff White People Like” showed just how easily we can use information to guess at race. It’s a real eye-opener (and the guys who wrote it didn’t include everything they learned — some of it was a bit too controversial). They simply looked at the words one group used which others didn’t often use. The result was a list of “trigger” words for a particular race or gender.

Now run this backwards. If I know you like these things, or see you mention them in blog posts, on Facebook, or in tweets, then there’s a good chance I know your gender and your race, and maybe even your religion and your sexual orientation. And that I can personalize my marketing efforts towards you.

That makes it a civil rights issue.

If I collect information on the music you listen to, you might assume I will use that data in order to suggest new songs, or share it with your friends. But instead, I could use it to guess at your racial background. And then I could use that data to deny you a loan.

Want another example? Check out Private Data In Public Ways, something I wrote a few months ago after seeing a talk at Big Data London, which discusses how publicly available last name information can be used to generate racial boundary maps:

Screen from the Mapping London project
Screen from the Mapping London project.

This TED talk by Malte Spitz does a great job of explaining the challenges of tracking citizens today, and he speculates about whether the Berlin Wall would ever have come down if the Stasi had access to phone records in the way today’s governments do.

So how do we regulate the way data is used?

The only way to deal with this properly is to somehow link what the data is with how it can be used. I might, for example, say that my musical tastes should be used for song recommendation, but not for banking decisions.

Tying data to permissions can be done through encryption, which is slow, riddled with DRM, burdensome, hard to implement, and bad for innovation. Or it can be done through legislation, which has about as much chance of success as regulating spam: it feels great, but it’s damned hard to enforce.

There are brilliant examples of how a quantified society can improve the way we live, love, work, and play. Big data helps detect disease outbreaks, improve how students learn, reveal political partisanship, and save hundreds of millions of dollars for commuters — to pick just four examples. These are benefits we simply can’t ignore as we try to survive on a planet bursting with people and shaken by climate and energy crises.

But governments need to balance reliance on data with checks and balances about how this reliance erodes privacy and creates civil and moral issues we haven’t thought through. It’s something that most of the electorate isn’t thinking about, and yet it affects every purchase they make.

This should be fun.

This post originally appeared on Solve for Interesting. This version has been lightly edited.

Strata Conference + Hadoop World — The O’Reilly Strata Conference, being held Oct. 23-25 in New York City, explores the changes brought to technology and business by big data, data science, and pervasive computing. This year, Strata has joined forces with Hadoop World.

Save 20% on registration with the code RADAR20


tags: , ,
  • Marc Robinson-Rechavi

    Great post, which links to concerns about the use of genetic data. With a biological sample of a person, I can get the genome sequence. And even if you allowed me to take it (with a certain use in mind), I can now use for many other uses. Moreover, many uses of a genome sequence will only become apparent in the (near) future, but your genome sequence might be hanging around for a long time.

    • Ellis84

      U.S. laws would prevent this.

      • Just because you make a law does not prevent data from being misused.

      • As I understand it, GINA (the Genetic Information Nondiscrimination Act) would protect Californians specifically around healthcare. Personally, with the EPA and other agencies being torn down, I’m not sure I’d put my faith in legislation entirely.

        The broader issue is that in a capitalist, free-market world, you aren’t being harmed or prejudiced against; you’re being given the opportunity for premium pricing or preferential treatment. This tragedy of the commons problem is systemic of public/private blended services like healthcare (as a Canadian, I see this all too well.)

  • Charles

    “Tying data to permissions can be done through encryption, which is slow, riddled with DRM, burdensome, hard to implement, and bad for innovation.”

    With all the down sides of abusing data you mentioned, these qualities of using encryption may be a positive. Whether it is “bad for innovation” is subjective. I suspect it wouldn’t be “bad” so much as slow things down, and perhaps that is exactly what we need in this case: time to think.

    • I was actually referring to the compute time—if I have to run a query that involves lookups on an external system, they’ll be slow. As the late Jim Gray pointed out, compared to the cost of bandwidth, everything else is free.

      But slowness in terms of reflection and taking the time to understand things has to be weighed against the need to fix serious, pervasive problems with the world around us. For example, fraud detection isn’t easy in Europe, in part because of privacy legislation.

  • I think you raise good points, although at least for the example of last names of Londoners… it seems like for the US, at least, this is peanuts compared to data already collected and widely published by the Census Bureau.

  • Pingback: Using Data Wisely | FreshSqueezedMarketing()

  • Great exploration of a topic we need more discussion about. I see this as a critical topic for those of us active in the space to keep in the conversation. Some additional thoughts here:

  • Pingback: Big Data Quotes of the Week: August 3, 2012 | What's The Big Data?()

  • This is an excellent piece that inspired me to think about what other than regulation would help to make Big Data less risky. Here’s my answer…process and oversight:

  • loriaustex

    Semi-pro historian that I am, I decided to take a different look at this, by looking for similar cases in the past. I don’t think it’s feasible to control or manage this (based on my take on history) but I do think questions about the destabilizing impact of Big Data Analytics are worth thinking about.

  • Pingback: Big Data & Civil Rights | Big Data | DATAVERSITY()

  • Pingback: Data Mining in the Publishing World | Musings and Marvels()

  • Consumers are increasingly accepting their loss of privacy as well as nefarious uses of big data. At the same time governments are reluctant to restrict commerce for fear of increasing unemployment. This is a dangerous combination that will likely contribute to more abuse of data.

  • Pingback: WRITING ON THE ETHER: Static | Jane Friedman()

  • Robert Ellis Smith

    You’re forgetting about “data pollution.” The more data an organization accumulates the harder it is to find the data it needs.

    The message of your piece is that to the extent many of us search and browse and collect information of our own online that is DIFFERENT from our neighbors and from our true interests and from the dominant culture, then the more we are protected from inquisitive tracking.

    Robert Ellis Smith, Publisher, Privacy Journal

  • We need to connect the idea of big data to the idea of root data structure in civilized Society. Its about the construct of an Individual life amidst a global population of Individuals. None of these computing capabilities were engineered to optimize the experience of Human freedom or personal authority. And yet they came from people who were doing just that. The Individual impetus must now be joined with an advanced model of the relationship that is core to a well-formed civil-Society.

    Individuals require sovereign source authority in order for big data to expose its true value through utility. As long as we lack that, we lack everything being well-formed, and problems ensue. Administrative authority is at the core of your root data structure as a Human being.

    If you do not know what that means… that is a problem. Your problem.

  • atimoshenko

    The privacy advantage of querying databases is that such a query itself leaves a trace. In other words, you cannot spy on me without at least someone/something being aware of your actions. The privacy and civil rights issue can then be solved by notifying the person being investigated of the investigation. For instance, I don’t have to know what my bank was looking at my music history for, I just have to be notified of any instance that they chose to do so. Then I can call them up and ask to explain themselves, and switch banks if that explanation is unsatisfactory.

    Greater information availability is not a problem. Information asymmetry between the ‘watchers’ and the ‘being watched’ is.

    • Micah Saltazar

      Such queries leave no trace unless all the software the surrounds the database is built to keep such records.

      In many instances, the software that provides the front-end (ie. the end-user interface) does keep track of queries, but the data extraction systems which populate the “big data” data warehouses, which pull data directly from the database (not going through the logging routines incorporated into the front-end) leave no traces whatsoever.

  • Pingback: Asking the right questions: big data and civil rights | Practical Ethics()

  • Pingback: 17 August 2012: Practical Ethics | Tap the 90()

  • Pingback: Is the Right to be Deleted the coming civil rights battle of our times? « Rational Law LLC | Steve Glista | Kalamazoo Michigan Business Lawyer()

  • stormkrow

    Thank God for Couchbase.
    And, you’re welcome. ;-)

  • Pingback: Three kinds of big data - O'Reilly Radar()

  • Pingback: Balancing health privacy with innovation will rely on improving informed consent - O'Reilly Radar()

  • Jay

    Spintronics for the drm, etc. encryption middleware with holographic storage and upgradeable holosynchronization protocols as the optics correlates get better.