Open Data Pointers

When I blogged about truly open data, readers sent me a lot of interesting links. I’ve collected them all below. Enjoy!

The Centre for Environmental Data Archival (CEDA) — hosts a range of activities associated with evironmental data archives. (Director is on Twitter, @bnlawrence)
CONNECT — open source healthcare data exchange being developed with Brian Behlendorf, one of the original developers of the Apache web server.
Phil Agre’s Living Data — prescient article in Wired from 1994.
Factual — web database that permits multiple values in tables, and you can apply different functions to choose which values you’ll use when you work with the data (e.g., “most recent”, “most popular”, …).
HDF — BSD-licensed toolkit and format for storing and organising large numeric datasets.
NetCDF — software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
Madagascar — an open-source software package for multidimensional data analysis and reproducible computational experiments with digital image and data processing in geophysics and related fields.
Data Documentation Initiative — standard for metadata of social science datasets.</li
The Memento Project — project to incorporate versioned web pages into regular browsing.
Data Sharing: A Look at the Issues — presentation from Science Commons data manager Kaitlin Thaney.
SIFN Datasprint — these folks are planning a sprint around data, the same way coders often have sprints around code.
Get Your Database Under Version Control — 2008 piece by Jeff Atwood on the need to version control your database.
CKAN — Comprehensive Knowledge Archive Network. The database of open datasets is itself an open dataset, managed by a versioned database.
Componentization and Open Data, Collaborative Development of Data: OKFN are figuring out packaging and structure of distributed data development. They seem closest to building what I was talking about.
Open Data Maturity Model (Slideshare) – I like the idea of progressing from amateur to professional and identifying milestones along the continuum, but I’m not convinced that the last two stages are based on existing projects. I’m a big fan on building frameworks from successful projects, rather than building the framework in isolation.
Data RSS — proposal for an API for data feeds.
Fedora Commons — open source for managing, indexing, and delivering datasets. Islandora integrates that into Drupal.
gEDA — GPL’d suite of Electronic Design Automation tools, some of which are applicable to non-electronics data projects.
You Cannot Run an Open Data Project Like an Open Source Project, Unless… — not always coherent disagreement with my post. His most lucid moment comes pointing out that government datasets have a single owner. This is the difference between intrinsic data (crimes reported to police) where the data is about the operations of an agency, and extrinsic data (wild horse populations in Arizona, tree ring climate records) where an agency sends people into the field or otherwise collects and possibly processes data from others’ labour. But even intrinsic data can be more collaboratively maintained: take bug reports and corrections from users (there is no 800 block of Main, did you mean the 80 block of Main?). It’s true: I can’t imagine a lot of collaboration around the preparation and distribution of pure sensor data (e.g., traffic data), but my post talked about more than collaborative generation: revision management, documentation, etc.

Open Data Pointers

Get the O’Reilly Data Newsletter