Truly Open Data

I’m kicking myself. I have spent a non-trivial number of hours talking to government departments and scientists about open data, talking up an “open source approach” to data, pushing hard to get them to release datasets in machine readable formats with reuse-friendly licenses. I’ve had more successes than failures, met and helped some wonderful people, and now have more mail about open data in my inbox than about open source. So why am I kicking myself?

I’m kicking myself because I’ve been taking far too narrow an interpretation of “an open source approach”. I’ve been focused on getting people to release data. That’s the data analogue of tossing code over the wall, and we know it takes more than a tarball on an FTP server to get the benefits of open source. The same is true of data.

Open source discourages laziness (because everyone can see the corners you’ve cut), it can get bugs fixed or at least identified much faster (many eyes), it promotes collaboration, and it’s a great training ground for skills development. I see no reason why open data shouldn’t bring the same opportunities to data projects.

Gov 2.0 Expo 2010

And a lot of data projects need these things. From talking to government folks and scientists, it’s become obvious that serious problems exist in some datasets. Sometimes corners were cut in gathering the data, or there’s a poor chain of provenance for the data so it’s impossible to figure out what’s trustworthy and what’s not. Sometimes the dataset is delivered as a tarball, then immediately forks as all the users add their new records to their own copy and don’t share the additions. Sometimes the dataset is delivered as a tarball but nobody has provided a way for users to collaborate even if they want to.

So lately I’ve been asking myself: What if we applied the best thinking and practices from open source to open data? What if we ran an open data project like an open source project? What would this look like?

First, we’d collaboratively build the dataset. This means we’d have a curator who is the equivalent of a project leader, taking patches and filtering for quality. Successful open source project leaders foster a group of developers of different skills, rewarding on merit while fostering new talent. Like open source projects, the nirvana state is to have a project that can survive the retirement or death of its founder.

But collaboration takes more than leadership–open source projects have tools that help. An open data project would need a mailing list to collaborate on, IRC or equivalent to chat in real-time, and a bug-tracker to identify what needs work and ensure that the users’ needs are being met. The official dataset of New Zealand school zones has errors but there’s nobody to report them to, much less a way to submit a fix to a maintainer. Oh, and don’t forget a way to acknowledge and credit contributors—think not just of credits.txt but also of the difference between patch submitter, committer, and project maintainer.

Open source software developers have a powerful set of tools to make distributed authoring of software possible: diff to identify what’s changed, patch to apply those changes elsewhere, version control to track changes over time and show provenance. Patch management would be just as important in a collaborative open data project, where users and other researchers might be submitting new or revised data. What would git for data look like? Heck, what would a local branch look like? I have a new attribute, you have a different projection, she has new rows, how does this all tie back together? (I eagerly await claims that RDF will solve this problem and all others)

That’s just development. The interface between developers and users is the release. State of the art for a lot of government data is the equivalent of source.tar.gz. No version numbers, much the ability to download older versions of the datasets or separate stable and development branches.

Why would we want to download the historic version of a dataset? Because a paper used it and we want to test the analysis software that the paper used to ensure we get the same answer. Or because we want to see what our analysis technique would have shown with the knowledge that was available back then. Or simply to be able to track defects.

The users of data will have to adapt to the idea of versions, like the users of software have. The maintainers of the dataset might release five different versions of it while you’re writing your analysis code, so it can’t be a painful process to incorporate the revised data into your project. With software we have shared libraries and dynamic libraries, supported by autotools and such packages. Our code has interfaces and a branch that promises backwards compatibility. What would that look like for data? And what is the data version of the dependency hell that software developers know all-too-well (M 1.5 depends on N 1.7 and P 2.0, but P 2.0 requires N 2.0, and upgrading N to 2.0 breaks M which expects the 1.x set of interfaces from N …).

And, of course, there’s documentation. As with software, I imagine we’ll see some docs structured and some unstructured. The state of the art isn’t great for government datasets, it has to be said: if you’re lucky you get a “code X means ABCD” but rarely are you told exactly how the data were generated, the limits on its accuracy, situations where it shouldn’t be used, etc.

Finally, we need to change attitudes and social systems. Data is produced as the product of work done, and is rarely conceived of as having a life outside the original work that produced it. Some datasets will (some won’t–think of how many projects fail to interest anyone but the person who started them). This means thinking of yourself not just as the person who does the work, but the person who leads a project of interested outsiders and (in some cases) collaborators and who is building something that will last beyond their time. This is not a natural mindset within government nor, in many cases, science. Funding and budgeting systems at the moment may prevent this, and would need to change.

The good news is that while government datasets are rarely generated collaboratively, science is a little further along. PubMed and GenBank are just two examples of great science collaborations that we can learn from, and I’m sure there are more. Beyond science, OpenStreetMap is an important example of collaborative data gathering and the Open Knowledge Foundation folks may have work in this area already. I’m keen to learn more about the open data projects that are more than just data-over-the-wall and share what I find. Time to stop kicking myself and start learning!

tags: ,