Perspectives on Open Data Workshop

I was part of a panel, Perspectives on Open Data Workshop, at Webstock last month, along with Fiona Romeo, Adrian Holovaty, David Recordon, and Toby Segaran. The New Zealand State Services Commission, for whom the workshop was held, have released the recordings and summarised notes and they’re well worth reading because my fellow panelists emitted scintillating photons of clue with almost every question. Remember, as you read it, that we’re talking about “Government-held non-personal data”–namely, not birthdates and fetishes.

Many thanks to the State Services Commission and Webstock for hosting the panel. We had a great time, and I think all the panelists walked away with a deeper understanding of the different dimensions to the challenge. What follows is a mixture of my preparation notes and the observations of my fellow panelists that really caught my ear. First read the summary prepared by the SSC, then you’ll have the context for what follows.

I began with a tip of the hat to the Open Data Principles from the Open Government Data project. They specify that:

Government data shall be considered open if they are made public in a way that complies with the principles below:

  1. Complete
    All public data are made available. Public data are data that are not subject to valid privacy, security or privilege limitations.
  2. Primary
    Data are collected at the source, with the finest possible level of granularity, not in aggregate or modified forms.
  3. Timely
    Data are made available as quickly as necessary to preserve the value of the data.
  4. Accessible
    Data are available to the widest range of users for the widest range of purposes.
  5. Machine processable
    Data are reasonably structured to allow automated processing.
  6. Non-discriminatory
    Data are available to anyone, with no requirement of registration.
  7. Non-proprietary
    Data are available in a format over which no entity has exclusive control.
  8. License-free
    Data are not subject to any copyright, patent, trademark or trade secret regulation. Reasonable privacy, security and privilege restrictions may be allowed.

Compliance must be reviewable.

That seemed pretty comprehensive to me.

Someone mentioned the Google Australan bush fire mashup, which the Victorian government refused to help. There was information coming from the private sector, but none from the state government.

Adrian used a lovely phrase, “works of journalism”, which put journalism up with art and literature.

I realized part way through that an effective open data policy must change the default setting on data from private to public. That is, data should be public unless there’s a reason to make it private. At the moment it’s private unless there’s a reason to make it public. The reason this must change is cost: the huge amount of marginal data (of use to a handful of people) will never escape the hurdle of justifying the benefit of treating it as a one-off special-case exemption to the rules, whereas if the default is public then the marginal data will be available without the transaction cost of justifying its public existence.

I also realized that Government is the provider of first and last resort. By first resort, the government has to get the data out there in a form for other projects to use. It can be ugly at first, but it has to be out the door. Then other people can clean it up, build visualizations, offer pre-digested MySQL dumps, etc. But government must also be the provider of last resort: one can’t assume those pre-digested dumps will be available forever, so the government must continue to offer the data for download even though others might (however permanently they may seem to) offer the data for download. It’s an odd role: to pitch the data out there and to be a backstop for when the data consumers swing and miss.

Crown Copyright came up as a huge problem. New Zealand, like most Commonwealth countries, doesn’t have the USA’s wonderful policy that information produced by the government is in the public domain. Instead there’s Crown copyright, a special kind of term-limited copyright that exists for works the government creates. This includes legislation and pretty much every piece of data you want. This creates yet another obstacle to data reuse with no benefit.

Adrian: “Out there” is better than not, we’ll do the work.

David: Build schemas with your customers.

Toby: name reconciliation is hard. (by this he meant figuring out that two different document types are talking about the same location, person, address, book, …)

How do you get replication from feeds? In other words, I can notify a job is available but what do I do if the position is then withdrawn? Adrian asks government sources for delta feeds: one for additions, one for changes, one for deletions.

The panel was asked for their preferred formats. Tony: JSON, Adrian liked any XML with a strongly-defined schema.

Toby wasn’t hot on RDFa, but said RDF is important for namespaces.

Toby: Best ontologies are built after the fact.

Adrian: Public data is a good idea, but it’s useless unless you explain the codes. (e.g., that a 401 is “threatened with knife” and 581 is “menacing with vegetable”)

Lots of places have “gray” data–it’s used day to day, but there’s lots of knowledge around it that means the people who work with it “nudge” it each time it’s used. What to do? The answer is basically to turn it into an open source project: mailing list, email address and code, etc.

Adrian: keep in mind the 3am factor–make it as easy as possible for someone to build something with your data at 3am (e.g., no human-approved registration process).

Adrian identified dcstat as “best US work in open data at the city level.”

Fiona pointed to the British Show Us A Better Way. The UK Govt wanted to learn what people wanted to find, which datasets they wanted opened. So they asked.

There were questions about the value of running contests. They’re a gimmick, and it’s not a surefire thing that you’ll get an audience for boring data simply by running a contest.

There were questions about the “explicit costs” of opening data. I likened it to writing software: the tools and processes for building code that will be seen by other people give you higher-quality code than is often produced by closed processes where bad practices aren’t exposed to the disinfectant of sunlight. In other words, you’re more efficient if your processes are for publishing than if they’re purely for private consumption.

Fiona gave some suggestions for how to charge. Don’t penalize popular applications: the UK meteorological data makes it free to build something small and local, but if you want to roll that out to the whole country then you pay.