We need open models, not just open data

Writing my post about AI and summoning the demon led me to re-read a number of articles on Cathy O’Neil’s excellent mathbabe blog. I highlighted a point Cathy has made consistently: if you’re not careful, modelling has a nasty way of enshrining prejudice with a veneer of “science” and “math.”

Cathy has consistently made another point that’s a corollary of her argument about enshrining prejudice. At O’Reilly, we talk a lot about open data. But it’s not just the data that has to be open: it’s also the models. (There are too many must-read articles on Cathy’s blog to link to; you’ll have to find the rest on your own.)

You can have all the crime data you want, all the real estate data you want, all the student performance data you want, all the medical data you want, but if you don’t know what models are being used to generate results, you don’t have much. You’re going to be showing black people homes in predominantly black neighborhoods not because you want to keep white neighborhoods pure, but because that’s where the model says they’re most likely to buy. You’re going to be stopping and searching more minority drivers without cause not because you’re prejudiced, but because the model says they’re more likely to be arrested for crimes. And if you stop more minority drivers, you almost certainly will arrest more minority drivers, so the model becomes self-fulfilling.

Intentions mean nothing when they’re hidden behind a model that makes decisions for you. A recent study of police profiling in my state, Connecticut, showed not only that blacks were more likely to be stopped than whites, but also that when they were stopped and searched, whites were significantly more likely to have something illegal in their cars. How would we build a model from this data, and what would it show? How would we know what the model is doing, if it’s never examined? Would the column with surprising data be dropped because it leads to unexpected and politically unacceptable results? Would it be weighted less than a column on, say, past arrests? If the model isn’t open, how would you ever know? As we become more dependent on modeling, more and more of our world becomes inscrutable. Without the models, you will never understand the way financial markets are manipulated. Without the models, you will never understand how school teachers are evaluated. You may never know why the real estate agent showed you certain houses, or why you’re paying so much for insurance. Is that OK? It all seems nice and scientific.

Open data enables the democratization of data. It’s important to be able to do your own analysis of public data sets. But if you really want to understand the effect data is having on law enforcement, on insurance, or on education, or on the economy, you need the models. Cathy has documented being stonewalled on requests for the models, which are almost always viewed as proprietary. That’s a problem, particularly when the modellers (not the poets) become the “unacknowledged legislators of the world” (Shelley, A Defense of Poetry).

Open models: the time has come.

Cropped image on article and category pages by Sonny Abesamis on Flickr, used under a Creative Commons license.

We need open models, not just open data

If you really want to understand the effect data is having, you need the models.