Rethinking Open Data

Lessons learned from the Open Data front lines

In the last year I’ve been involved in two open data projects, Open New Zealand and I believe in learning from experience and I’ve seen some signs recently that other projects might benefit from my experience, so this post is a recap of what I’ve learned. It’s the byproduct of a summer reflection on my last nine months working in open data.

Technologists like to focus on technology, and I’m as guilty of that as the next person. When Open New Zealand started, we rushed straight to the “catalogue”. I was part of a smart group of top-notch web hackers–we know what a catalogue is, it’s a web-based database and let’s figure out the UI flow and which fields do we want and hey I can hack one up in WordPress and I’ll work on the hosting and so on. We spent more time worrying about CSS than we did worrying about the users.

This is the exact analogue of an open source software failure mode: often companies think they can get all the benefits of open source simply by releasing their source code. The best dinner parties are about the other people. Similarly, the best open source projects have great people, attract great people, and the source is simply what they’re working on: necessary but not sufficient. You can build it but they won’t come. All successful open source projects build communities of supportive engaged developers who identify with the project and keep it productive and useful.

Data catalogues around the world have launched and then realised that they now have to build a community of data users. There’s value locked up in government data, but you only realise that value when the datasets are used. Once you finish the catalogue, you have to market it so that people know it exists. Not just random Internet developers, but everyone who can unlock that value. This category, “people who can use open data in their jobs” includes researchers, startups, established businesses, other government departments, and (yes) random Internet hackers, but the category doesn’t have a name and it doesn’t have a Facebook group, newsletter, AGM, or any other way for you to reach them easily.

This matters because it costs money to make existing data open. That sounds like an excuse, and it’s often used as one, but underneath is a very real problem: existing procedures and datasets aren’t created, managed, or distributed in an open fashion. This means that the data’s probably incomplete, the document’s not great, the systems it lives on are built for internal use only, and there’s no formal process around managing and distributing updates. It costs money and time to figure out the new processes, build or buy the new systems, and train the staff.

In particular, government and science are often funded as projects. When the project ends, the funding stops. Ongoing maintenance and distribution of the data hasn’t been budgeted for almost all the data sets we have today. This attitude has to change, and new projects give us the chance to get it right, but most existing datasets are unfunded for maintenance and release.

So while opening all data might be The Right Thing To Do from a philosophical perspective, it’s going to cost money. Governments would rather identify the high-value datasets, where great public policy comment, intra-government optimisation, citizen information, or commercial value can be unlocked. Even if you don’t buy into the cost argument, there’s definitely an order problem: which datasets should we open first? It should be the ones that will give society the greatest benefit soonest. But without a community of users to poll, a well-known place for would-be data consumers to come to and demand access to the data they need, the policy-making parts of governments are largely blind to what data they have and what people want.

That’s not to say that data catalogues aren’t useful. We were scratching an itch–we wanted easier access to government data, so we built the tool that would provide it. The community of data users can be built around the tool. As Krishna was told by Arjuna, “a man must go forth from where he stands. He cannot jump to the Absolute, he must evolve toward it”. I’m just noting that, as with all creative endeavours, we learned about the problem by starting to fix it.

Which brings me to the second big lesson: which problem are we trying to solve? There’s an Open Data movement emerging around governments releasing data. However, there are at least five different types of Open Data groupie: low-polling governments who want to see a PR win from opening their data, transparency advocates who want a more efficient and honest government, citizen advocates who want services and information to make their lives better, open advocates who believe that governments act for the people therefore government data should be available for free to the people, and wonks who are hoping that releasing datasets of public toilets will deliver the same economic benefits to the country as did opening the TIGER geo/census dataset.

The one thing these groups don’t share is an outcome. I can imagine an honest government where the costs of transparency overweigh the costs of corruption (think of the cost of removing every dirt particle from your house). I can imagine PR wins that don’t come from delivering real benefits to citizens, in fact I see this in a recent tweet by Sunlight Labs’s Ellen Miller:

Most of the raw data released by the OGD most likely isn’t for you to use.

She’s grumbling, as does this Washington Post piece, about the results so far from the Open Government Directive, which has prompted datasets of questionable value to be added to If this is the future, where’s my flying car? If this is open data, where’s my damn transparency?

There are some promising signs. The UK government data catalogue had a long beta period where developers were working with the data. The UK team built a community as well as a catalogue. That’s not to say that the UK effort is all gold–I saw plenty of frustration with RDF while I was observing the developers–but it stands out simply for the acknowledgement of users. Similarly, the UK’s MySociety defined what success is to them: they’re all about building useful apps for citizens, and open data is a means not an end to them.

So, after nearly a year in the Open Data trenches, I have some advice for those starting or involved in open data projects. First, figure out what you want the world to look like and why. It might be a lack of corruption, it might be a better society for citizens, it might be economic gain. Whatever your goal, you’ll be better able to decide what to work on and learn from your experiences if you know what you’re trying to accomplish. Second, build your project around users. In my time working with the politicians and civil servants, I’ve realised that success breeds success: the best way to convince them to open data is to show an open data project that’s useful to real people. Not a catalogue or similar tool aimed at insiders, but something that’s making citizens, voters, constituents happy. Then they’ll get it.

My next project with Open New Zealand is to build a community of data users. I want to see users supporting each other, I want to build a tight feedback loop between those who want data and those who can provide it, to create an environment where the data users can support each other, and to make it easier to assess the value created by government-released open data. Henry Kissinger said, “each success only buys admission to a more difficult problem”. I look forward to learning what the next problem is.

tags: , , , ,
  • ADSLGeek

    Really nice post mate!

    I am going to have to have a bit of a play with the NZ govt data and see what I can come up with!

  • Egon Willighagen

    … but the category doesn’t have a name and it doesn’t have a Facebook group, newsletter, AGM, or any other way for you to reach them easily.

    Chemistry has seen since 1995 the Blue Obelisk movement establishing itself as contact point for Open Source, Open Data, and Open Specifications in chemistry. It was announced in a cheminformatics journals, and should perhaps have tried to get into a leading organic chemistry journal… but we have been there and talking a lot with people on Open X.

    Agreed, we do not have a Facebook group :) But here are some public pointers:


  • W. W. Munroe

    I had an interesting experience with government when our team successfully made all the public statistical data available via the internet with table and graphing outputs and downloads in 2005.

    Since the data was made easily accessible and the numbers produced by the province could be easily compared with the Federal numbers, the differences were readily apparent.

    For example, the provincial population numbers showed an increase in young adults moving into the province including retirement areas, while the Federal numbers showed a decline. The provincial population statistics unit assumed that because the newly elected governing party would improve the economy, young adults were moving into all areas of the province, attracted to new job opportunities. Unfortunately, the opposite was true and more young adults were leaving as could be seen in the Federal numbers.

    The project was shut down, and an effort was made to cover up the many non-statistical and sub-standard methods and models used to create population numbers.

    The numbers produced by the province are used to justify opening and closing education and health facilities, for official community plans, to determine Election boundaries, for electrical generation projects etc.

    Why make data available and verifiable? Basing plans on unreliable numbers can increase costs and waste time unnecessarily.

  • Antti Poikola

    Great article, thanks for sharing it with the world!

    In Finland we speak about Open Data Ecosystem, which states that the open data world is not polarized as data producers and and reusers. The government mostly produces the data in order to use it and the community of “data reusers” actually may patch the datasets and produce even new data. Bets thing is that the data flows freely and the members of the ecosystem get to know each others.

  • Rosalyn Metz

    I think government does have a place to go to poll who is using what datasets, its called the library.

    There are millions upon millions of people every day that go into their libraries and ask questions like: “what is the GDP for those nations that have the highest infant mortality rate?”

    if governments took a look at libraries as a tool rather than cutting library budgets, they might now this.

  • Paul Boos

    I personally loved the article; thanks for writing it.

    A curious question I have for any folks out there to answer…

    Should Governments publish data (exceptions: clssified or sensitive data such as privacy data) under an open source type license (say creative commons)? And as a follow-on, should it be copy-left in that all uses of it should be made publicly available as well? (since the public’s money paid for it)

  • Jon Udell

    Outstanding, Nat. Exactly right and beautifully said.

  • Nat Torkington

    @Egon: good point, there may be groups for specific types of data. It’s a mixture of making contact with existing communities and building a new one for everyone who doesn’t already have such a community.

    @Antti: yes, I have another post brewing about collaborative data projects. We haven’t even begun to see what happens when open source methods are applied to data!

    @Rosalyn: good point, but there’s no formal feedback mechanism through libraries. The librarians will be part of the larger community of data but, for now anyway, the agents for openness within the government need more immediate and direct contact with their users.

    @Paul: Yes, they should publish data. They should use as permissive a license as possible, and one custom designed for data (don’t reuse source code or content licenses, as data has its own needs).

    I bet there’s a flamewar to be had about reciprocal licenses (GPL-like “you must now distribute your data extensions under the same license as you received the original data”). I think it depends on the dataset: for some it won’t matter which license you go with. For others where the goal is collaboration, a GPL-like license removes a lot of the reward from forking. For some where economic benefits are the reason for release, BSD-like seems appropriate (I note projects like PostgreSQL where a BSD license hasn’t prevented cooperation).

  • ian

    great post nat. data is a lot like the iceberg metaphor. no metadata=big time time sink to understand, ETL on its own is a business, stale data=diminished value or even worse, misinterpretation leading to poor decisionmaking.

  • Virginia Carlson

    There may not be a FaceBook group, yet, but there exists a network of local community organizations who have been trying to bring data to bear on local community decision-making in the U.S.–some partners for decades. The National Neighborhood Indicators Partnership has been sourcing and re-purposing government data since long before the internet was born. Let’s get gov 2.0 “hackers” to the next NNIP conference:

  • Mike Mathieu

    At Front Seat, about 3 years ago we filled a white board with ideas of civic data sets that we thought might be useful, and pondered what kinds of mashups we could make out of it. Then we realized that creativity was only unleashed when we had specific problems to solve, rather than potential opportunities to exploit. That thinking went into Walk Score and newer projects like City-Go-Round, which advocates for open transit data, and gives a geo search engine for transportation apps.

  • Andrew Krzmarzick

    Really excellent post, Nat. One of my favorite lines:

    “Governments would rather identify the high-value datasets, where great public policy comment, intra-government optimisation, citizen information, or commercial value can be unlocked. Even if you don’t buy into the cost argument, there’s definitely an order problem: which datasets should we open first?”

    We are answering a similar question in a dialogue over on GovLoop:

    Also, this whole notion of “which data first?” was one of the key themes that emerged from CityCamp last weekend:

    Your post here should be required reading for all government folks who are beginning to unveil their datasets. I plan to share it far and wide.

    – Andy
    GovLoop Community Manager

  • David Sonnen


    Great start! This is the kind of critical thinking we need to make sense of “open” in data.

    Your point about people having different outcomes for data is key. To make the point, it might be useful to compare open data and open source

    Data represents something — buyer’s behavior, places on the ground, money spent, soil types, what ever. People’s interest in what ever the data represents varies wildly.

    In contrast, open source code does something — draws a map, calculates the odds, dials a phone, whatever. People’s interest in a piece of code is in what the code does.

    Code has a built-in outcome. Data doesn’t.

    There are a few other characteristics of data that we need to consider. 1) Data can be a thing or a process; 2) Data can be a byproduct of a process or a product for syndication; 3) Data’s value depends on what it is used for; and 4) Data costs real money to produce.

    Finally, open data can, in concept, be quite valuable. But, for that data to be produced continuously, there has to be an economic system that sustains the producers.

    Lots to think about, Nat. Keep up the critical thinking.

  • Mark Essel

    Your driving priorities about building a community of folks interested in the data, sounds much like the lean startup movement. As a founder, I seek to optimize the value unlocked by my teams efforts. I think you’re tackling the open data problem in a similar way. I think I’m all of the above in your categories of open web supporters minus the political PR play. Yeah I’m a wonk who expects unexpected linked data to reveal enormous social data.

    Sincerely wish you the best of luck. Would love to connect as a supporter with bankrupt free time.

  • Bruce Bannerman

    Excellent post Nat.

    I addressed similar concerns in the OSGeo-AustNZ submission to the Australian Victorian Government’s Parliamentary ‘Inquiry into Improving Access to Victorian Public Sector Information and Data’ [1] prepared by Cameron Shorter and myself.

    I apologise for refering to my own work, however this is too important an issue for niceties.

    Most of us would like access to ‘Authoritative’ spatial data sources that are accurate (positionally as well as by aspatial content); timely; well maintained; and free of charge.

    As you point out it is an expensive undertaking to capture, maintain and serve out such datasets. Even Internet bandwidth does not come free.

    I believe that it is possible to achieve the utopian model above.

    Open communities such as Open Streetmap are showing a ‘possible’ way forward.

    If we were to combine the best of communities such as Open Streetmap; Open Source community development models; with government and industry, best of breed datasets then we could be on our way.

    Our ‘cost of access’ to the data may well be our need to help maintain it.

    For more thoughts on this see [1] pp 16, 25, 30-31.



  • Nat Torkington

    @Virginia: thanks for the pointer to the NNIP group. Let’s also get some NNIP folks to the Gov 2.0 events!

    @Mike: I love that line, “creativity was only unleashed when we had specific problems to solve, rather than potential opportunities to exploit.”. Thanks!

    @Andrew: thanks for the pointer to the GovLoop discussion. I suspect this conversation is beginning to happen in many different places.

    @David: I really like your observation that code has a built-in outcome, whereas data doesn’t. When you say, “for that data to be produced continuously, there has to be an economic system that sustains the producers”, I mentally substitute “code” for “data” and see where the thought takes me …

    @Mark: I’m definitely a fan of the lean startup movement. I’m trying to iterate and learn from everything I do, whether in a startup or not. Hence the reflective period to figure out what I’d learn and how I’d change what I’m doing. If you ever find another mortgage broker selling sub-prime time loans, let me know :)

    @Bruce: thanks for the kind words. I love Open Street Map, they’re doing great work not only in gathering data but also thinking about licensing. They’re the ones who first really nailed for me why CC and GPL licenses aren’t appropriate, and I have huge respect for their honest and diligent efforts to find or make a better license that reflects the goals, ideals, and practices of that community. I’m starting to wonder what an OSI for data would be (not to mention an FSF!).

  • David Sonnen


    One more thought:The economic system that sustains producers has to work when the producers and users are the same folks. That’s an interesting and important part of open source and open data.


  • Jonathan Gray


    “I’m starting to wonder what an OSI for data would be (not to mention an FSF!).”

    Have you seen the Open Knowledge Foundation, and its Open Knowledge Definition (like the F/OSS defintions but for content/data…):


  • JHW

    Note: repeat of comments on: ; meant to post them here.

    You may want to survey some of the “lessons learned” publications and research from the digital library world regarding the problem of, “if you build it, they won’t come”, which libraries and archives learned the hard way from the Institutional Repository movement. This movement was supposed to unlock taxpayer funded research for access/use by taxpayers. Well, they built them, have to support and maintain these repositories, and the users didn’t come. D-Lib Magazine is a good starting point:

  • Matt Johnston

    Our local open data group is “code4pizza” which is the community which links to the local government open data repo – “OpenDataNI”

    I think this is the best model. The group provides opportunity (by chasing out new and compelling datasources) and the individuals provide innovation (and hopefully commerce, revenue).

    Now – we have debates in the group – but it’s around whether to use git or svn, Google Maps or OpenStreetMap?

  • Puneet Kishor

    This is a brilliant post Nat. Thanks for sharing your insight. I do want to add a caveat to this — I hope this post is not taken as ammo for *not* working toward open data. Nat’s insights were achieved *after* going through the exercise of opening data. That the results were not necessarily as exhilarating as expected is not an argument for closing up — I am pretty certain that Nat didn’t expect his essay to be used as such, but I want to lay this out in the open for anyone itching to do so.

  • David Swann

    Great contribution Nat – thank you.

    As I work to help SDI get off the ground in NZ, a number of your points resonate!

    On one level it is interesting watching government struggle to figure out how to ‘do’ the web. Government demands rigid programs of work based on formal statements of requirement defined as objective metrics. That’s a nice safe way of doing vertical things – creating a system that does one thing really well. It’s not such a good way of doing horizontal stuff – like creating spatial data infrastructures. And you rightly make the point about ongoing data maintenance.

    On another level too many mandarins still associate power with budget; budget with system; and system with data. Try to imagine the perspective of someone who equates data with power: why indeed would you want to free up that power? Imagine ‘Yes Minister 2.0’ to think about the huge organizational challenges we face in implementing SDI.

    When technology was the barrier it provided the perfect obfuscation. Today, in the geospatial domain, all the key SDI standards are nicely implemented in almost all proprietary or open source technologies. For many organizations, in technology terms it really is a matter of switching the capability on. But someone has to want to flick that switch; and then we’re into who actually wants to take responsibility for something that delivers little direct glory.

    You’re right about government tending to want to identify ‘high-value’ data. That inventorying instinct misses a key point about who defines ‘value’. One of the glorious things about the Internet is the sheer amount of crap out there. That’s because what’s useless to me might be incredibly valuable to someone else… and vice versa. And it’s up to me as the user to determine that usefulness – and part of that measure is the reputation of the provider. I think of SDI as a spatial extension to the Internet and want to revel in the richness… I don’t want a sterile wasteland of ‘useful’ spatial data.

    You’ve given me a lot to think about! Thanks.


  • Nat Torkington

    @David: You’re absolutely right about the inventorying instinct. It’s the old “we know best” attitude in another form, a form of data paternalism or at best hubris. But as I said, if it’s a battle to open each piece of data but it gets easier with “successes”, then we should be trying to liberate the more obviously useful data first. I hate to break it to the Dept of Statistics, but that’s not national milk powder production figures.

    @Puneet — yes, point absolutely taken. I spoke for an hour to the Economist’s journalist, but in the end my contribution was largely reduced to “Torkington admitted failure”. This was frustrating, because I think we achieved a great success in opening data–it’s just that the first win told us where the next would come from.

  • Puneet Kishor

    Nat wrote —

    @Puneet — yes, point absolutely taken. I spoke for an hour to the Economist’s journalist, but in the end my contribution was largely reduced to “Torkington admitted failure”. This was frustrating, because I think we achieved a great success in opening data–it’s just that the first win told us where the next would come from.

    Reminds me of the journalist who saw Bill Gates’ dog swimming. Next day’s headlines scream, “Bill Gates’ dog can’t walk on water!”

  • Jose Leal

    Thanks for the post Nat!

    I was wondering if you thought if the project we’re working on might help to address some of the issues you have so clearly identified.

    Our project is Open Marketplace ( and it’s in its infancy, but we are quickly moving to establish an organization that will develop the Open Marketplace platform. We see the platform as the means to solving a number of marketplace problems. One of the problems is this emerging issue of how to produce, distribute, manage and update Open Data.

    It would be great if you found the time to take a look at OM and let us know what you think.

    Again, thanks for the great post either way.

    Jose Leal

  • Martin Borman

    In order to benefit flly of the potential of Open Data, Governments need to think of Open Data as a proces. Indeed it is a means to an end.

    In our day-to-day consultancy we use the Open Data Maturity Model ( to show Governments where they now stand and where they need to grow.

  • Julian Tait

    Open Data is all about the people, without which there would be no purpose and should be seen as a wider move to a more freer and equitable society; although whether this will be achieved is debatable. I am leading a project in Manchester, UK called Open Data Cities and it looks at the release of local data that will enable people to make meaningful choices in how they navigate and use the city in which they live and work and I agree that the only way that you can convince many civil servants is by actually showing them use cases. Most are still wedded to the notion that they are the owners of the data and not the custodians, and are so risk averse that the idea of opening up data for scrutiny makes them run for cover or hoist the barricades.

  • Steve Ardire

    Nice post Nat !

    Yes indeed just publishing open datasets is Step 1 which begs the question What exactly are the most meaningful things users want to do and especially for information intensive harder-to-solve problems like policy formation and change?

    For this Step 2 should be aggregate and map these open datasets into useful open semantic frameworks.

    Why ? Well one good reason is this provides more meaning to information relationships so more readily understandable and actionable.

    Check out