Rethinking Open Data

In the last year I’ve been involved in two open data projects, Open New Zealand and data.govt.nz. I believe in learning from experience and I’ve seen some signs recently that other projects might benefit from my experience, so this post is a recap of what I’ve learned. It’s the byproduct of a summer reflection on my last nine months working in open data.

Technologists like to focus on technology, and I’m as guilty of that as the next person. When Open New Zealand started, we rushed straight to the “catalogue”. I was part of a smart group of top-notch web hackers–we know what a catalogue is, it’s a web-based database and let’s figure out the UI flow and which fields do we want and hey I can hack one up in WordPress and I’ll work on the hosting and so on. We spent more time worrying about CSS than we did worrying about the users.

This is the exact analogue of an open source software failure mode: often companies think they can get all the benefits of open source simply by releasing their source code. The best dinner parties are about the other people. Similarly, the best open source projects have great people, attract great people, and the source is simply what they’re working on: necessary but not sufficient. You can build it but they won’t come. All successful open source projects build communities of supportive engaged developers who identify with the project and keep it productive and useful.

Data catalogues around the world have launched and then realised that they now have to build a community of data users. There’s value locked up in government data, but you only realise that value when the datasets are used. Once you finish the catalogue, you have to market it so that people know it exists. Not just random Internet developers, but everyone who can unlock that value. This category, “people who can use open data in their jobs” includes researchers, startups, established businesses, other government departments, and (yes) random Internet hackers, but the category doesn’t have a name and it doesn’t have a Facebook group, newsletter, AGM, or any other way for you to reach them easily.

This matters because it costs money to make existing data open. That sounds like an excuse, and it’s often used as one, but underneath is a very real problem: existing procedures and datasets aren’t created, managed, or distributed in an open fashion. This means that the data’s probably incomplete, the document’s not great, the systems it lives on are built for internal use only, and there’s no formal process around managing and distributing updates. It costs money and time to figure out the new processes, build or buy the new systems, and train the staff.

In particular, government and science are often funded as projects. When the project ends, the funding stops. Ongoing maintenance and distribution of the data hasn’t been budgeted for almost all the data sets we have today. This attitude has to change, and new projects give us the chance to get it right, but most existing datasets are unfunded for maintenance and release.

So while opening all data might be The Right Thing To Do from a philosophical perspective, it’s going to cost money. Governments would rather identify the high-value datasets, where great public policy comment, intra-government optimisation, citizen information, or commercial value can be unlocked. Even if you don’t buy into the cost argument, there’s definitely an order problem: which datasets should we open first? It should be the ones that will give society the greatest benefit soonest. But without a community of users to poll, a well-known place for would-be data consumers to come to and demand access to the data they need, the policy-making parts of governments are largely blind to what data they have and what people want.

That’s not to say that data catalogues aren’t useful. We were scratching an itch–we wanted easier access to government data, so we built the tool that would provide it. The community of data users can be built around the tool. As Krishna was told by Arjuna, “a man must go forth from where he stands. He cannot jump to the Absolute, he must evolve toward it”. I’m just noting that, as with all creative endeavours, we learned about the problem by starting to fix it.

Which brings me to the second big lesson: which problem are we trying to solve? There’s an Open Data movement emerging around governments releasing data. However, there are at least five different types of Open Data groupie: low-polling governments who want to see a PR win from opening their data, transparency advocates who want a more efficient and honest government, citizen advocates who want services and information to make their lives better, open advocates who believe that governments act for the people therefore government data should be available for free to the people, and wonks who are hoping that releasing datasets of public toilets will deliver the same economic benefits to the country as did opening the TIGER geo/census dataset.

The one thing these groups don’t share is an outcome. I can imagine an honest government where the costs of transparency overweigh the costs of corruption (think of the cost of removing every dirt particle from your house). I can imagine PR wins that don’t come from delivering real benefits to citizens, in fact I see this in a recent tweet by Sunlight Labs’s Ellen Miller:

Most of the raw data released by the OGD most likely isn’t for you to use.

She’s grumbling, as does this Washington Post piece, about the results so far from the Open Government Directive, which has prompted datasets of questionable value to be added to data.gov. If this is the future, where’s my flying car? If this is open data, where’s my damn transparency?

There are some promising signs. The UK government data catalogue had a long beta period where developers were working with the data. The UK team built a community as well as a catalogue. That’s not to say that the UK effort is all gold–I saw plenty of frustration with RDF while I was observing the developers–but it stands out simply for the acknowledgement of users. Similarly, the UK’s MySociety defined what success is to them: they’re all about building useful apps for citizens, and open data is a means not an end to them.

So, after nearly a year in the Open Data trenches, I have some advice for those starting or involved in open data projects. First, figure out what you want the world to look like and why. It might be a lack of corruption, it might be a better society for citizens, it might be economic gain. Whatever your goal, you’ll be better able to decide what to work on and learn from your experiences if you know what you’re trying to accomplish. Second, build your project around users. In my time working with the politicians and civil servants, I’ve realised that success breeds success: the best way to convince them to open data is to show an open data project that’s useful to real people. Not a catalogue or similar tool aimed at insiders, but something that’s making citizens, voters, constituents happy. Then they’ll get it.

My next project with Open New Zealand is to build a community of data users. I want to see users supporting each other, I want to build a tight feedback loop between those who want data and those who can provide it, to create an environment where the data users can support each other, and to make it easier to assess the value created by government-released open data. Henry Kissinger said, “each success only buys admission to a more difficult problem”. I look forward to learning what the next problem is.

Rethinking Open Data

Lessons learned from the Open Data front lines