Open Data Pointers

When I blogged about truly open data, readers sent me a lot of interesting links. I’ve collected them all below. Enjoy!

  1. The Centre for Environmental Data Archival (CEDA) — hosts a range of activities associated with evironmental data archives. (Director is on Twitter, @bnlawrence)
  2. CONNECT — open source healthcare data exchange being developed with Brian Behlendorf, one of the original developers of the Apache web server.
  3. Phil Agre’s Living Data — prescient article in Wired from 1994.
  4. Factual — web database that permits multiple values in tables, and you can apply different functions to choose which values you’ll use when you work with the data (e.g., “most recent”, “most popular”, …).
  5. HDF — BSD-licensed toolkit and format for storing and organising large numeric datasets.
  6. NetCDF — software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
  7. Madagascar — an open-source software package for multidimensional data analysis and reproducible computational experiments with digital image and data processing in geophysics and related fields.
  8. Data Documentation Initiative — standard for metadata of social science datasets.</li
  9. The Memento Project — project to incorporate versioned web pages into regular browsing.
  10. Data Sharing: A Look at the Issues — presentation from Science Commons data manager Kaitlin Thaney.
  11. SIFN Datasprint — these folks are planning a sprint around data, the same way coders often have sprints around code.
  12. Get Your Database Under Version Control — 2008 piece by Jeff Atwood on the need to version control your database.
  13. CKAN — Comprehensive Knowledge Archive Network. The database of open datasets is itself an open dataset, managed by a versioned database.
  14. Componentization and Open Data, Collaborative Development of Data: OKFN are figuring out packaging and structure of distributed data development. They seem closest to building what I was talking about.
  15. Open Data Maturity Model (Slideshare) – I like the idea of progressing from amateur to professional and identifying milestones along the continuum, but I’m not convinced that the last two stages are based on existing projects. I’m a big fan on building frameworks from successful projects, rather than building the framework in isolation.
  16. Data RSS — proposal for an API for data feeds.
  17. Fedora Commons — open source for managing, indexing, and delivering datasets. Islandora integrates that into Drupal.
  18. gEDA — GPL’d suite of Electronic Design Automation tools, some of which are applicable to non-electronics data projects.
  19. You Cannot Run an Open Data Project Like an Open Source Project, Unless… — not always coherent disagreement with my post. His most lucid moment comes pointing out that government datasets have a single owner. This is the difference between intrinsic data (crimes reported to police) where the data is about the operations of an agency, and extrinsic data (wild horse populations in Arizona, tree ring climate records) where an agency sends people into the field or otherwise collects and possibly processes data from others’ labour. But even intrinsic data can be more collaboratively maintained: take bug reports and corrections from users (there is no 800 block of Main, did you mean the 80 block of Main?). It’s true: I can’t imagine a lot of collaboration around the preparation and distribution of pure sensor data (e.g., traffic data), but my post talked about more than collaborative generation: revision management, documentation, etc.

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

  • Bob Simons

    Please also mention OPeNDAP’s DAP protocol (, which is a protocol that specifies how to request a subset of a large dataset and specifies the over-the-wire format for the response. The simplicity, flexibility, and efficiency of the protocol have led to it’s wide usage in the oceanographic and other scientific communities. There are many DAP-compatible servers (Hyrax, THREDDS, PyDAP, ERDDAP, etc.) and many clients (JDAP, the NetCDF libraries, in Matlab, etc.).

  • John S. Erickson, Ph.D.

    Thanks for this useful list, Nat!

    In his recent talk “Overcoming Systemic Resistance to Generativity in Science” at Harvard’s Berkman Center, Science Commons VP John Wilbanks said during the Q&A that we are currently missing the data equivalent of SourceForge.

    Wilbnks’ point was that while there are many domain-specific silos (you’ve listed some!), we don’t yet have the equivalent of a SourceForge or Google Code for datasets. He wasn’t simply referring to a common, wildly accessible platform for uploading and accessing datasets; is point was more about a common platform that would support communities forming around and supporting particular datasets, just as with source code.

    I’m wondering what your thoughts are on this?

  • Steve

    Hi Nat,

    Thanks for the very useful post:)

    While ‘versions’ are important in data sets, one shouldn’t loose sight of the underlying principle that a database (typically within an SQL-oriented DBMS) is a ‘state-machine’, and the current state of a production database is ‘the best set of data currently available’. Making ‘that’ data publicly available, particularly via the web, and particularly via the webs main language – HTML is a very important part of an Open Data movement.

    In this vain of reasoning, I would add to your list of open source technologies, the ‘SQL+PaWS’ tool for making data in relational DBMS widely available.

    It allows standard SQL statements via a simple HTML form interface to underlying SQL servers, retrieving requested data in standard HTML tables, with appropriate meta-data (for client software purposes).

    An implementor of such an Open Source data set (using SQL+PaWS) simply needs to add three things to the provided JSP server-side code, in order to make their whole database Open as a HTML web service, namely: database-name, a user name, a password
    …the rest is automatic.

    SQL+PaWS Home Page:

    A Blog entry on it, at release is here:

    There are soooo many databases held in relational DBMSs just begging to be made available to the general global public, with technical complexity (and hence cost) being the most significant impediment.

    [Of course the concept of ‘versions’ in datasets is what Data Warehousing is all about.]


  • Joseph Kelly

    Please also include, which is an open data repository. Users can post data under any license they want and we will host it for free. Our collection has several thousand datasets on a variety of different topics.