Truly Open Data

I’m kicking myself. I have spent a non-trivial number of hours talking to government departments and scientists about open data, talking up an “open source approach” to data, pushing hard to get them to release datasets in machine readable formats with reuse-friendly licenses. I’ve had more successes than failures, met and helped some wonderful people, and now have more mail about open data in my inbox than about open source. So why am I kicking myself?

I’m kicking myself because I’ve been taking far too narrow an interpretation of “an open source approach”. I’ve been focused on getting people to release data. That’s the data analogue of tossing code over the wall, and we know it takes more than a tarball on an FTP server to get the benefits of open source. The same is true of data.

Open source discourages laziness (because everyone can see the corners you’ve cut), it can get bugs fixed or at least identified much faster (many eyes), it promotes collaboration, and it’s a great training ground for skills development. I see no reason why open data shouldn’t bring the same opportunities to data projects.


Gov 2.0 Expo 2010

And a lot of data projects need these things. From talking to government folks and scientists, it’s become obvious that serious problems exist in some datasets. Sometimes corners were cut in gathering the data, or there’s a poor chain of provenance for the data so it’s impossible to figure out what’s trustworthy and what’s not. Sometimes the dataset is delivered as a tarball, then immediately forks as all the users add their new records to their own copy and don’t share the additions. Sometimes the dataset is delivered as a tarball but nobody has provided a way for users to collaborate even if they want to.

So lately I’ve been asking myself: What if we applied the best thinking and practices from open source to open data? What if we ran an open data project like an open source project? What would this look like?

First, we’d collaboratively build the dataset. This means we’d have a curator who is the equivalent of a project leader, taking patches and filtering for quality. Successful open source project leaders foster a group of developers of different skills, rewarding on merit while fostering new talent. Like open source projects, the nirvana state is to have a project that can survive the retirement or death of its founder.

But collaboration takes more than leadership–open source projects have tools that help. An open data project would need a mailing list to collaborate on, IRC or equivalent to chat in real-time, and a bug-tracker to identify what needs work and ensure that the users’ needs are being met. The official dataset of New Zealand school zones has errors but there’s nobody to report them to, much less a way to submit a fix to a maintainer. Oh, and don’t forget a way to acknowledge and credit contributors—think not just of credits.txt but also of the difference between patch submitter, committer, and project maintainer.

Open source software developers have a powerful set of tools to make distributed authoring of software possible: diff to identify what’s changed, patch to apply those changes elsewhere, version control to track changes over time and show provenance. Patch management would be just as important in a collaborative open data project, where users and other researchers might be submitting new or revised data. What would git for data look like? Heck, what would a local branch look like? I have a new attribute, you have a different projection, she has new rows, how does this all tie back together? (I eagerly await claims that RDF will solve this problem and all others)

That’s just development. The interface between developers and users is the release. State of the art for a lot of government data is the equivalent of source.tar.gz. No version numbers, much the ability to download older versions of the datasets or separate stable and development branches.

Why would we want to download the historic version of a dataset? Because a paper used it and we want to test the analysis software that the paper used to ensure we get the same answer. Or because we want to see what our analysis technique would have shown with the knowledge that was available back then. Or simply to be able to track defects.

The users of data will have to adapt to the idea of versions, like the users of software have. The maintainers of the dataset might release five different versions of it while you’re writing your analysis code, so it can’t be a painful process to incorporate the revised data into your project. With software we have shared libraries and dynamic libraries, supported by autotools and such packages. Our code has interfaces and a branch that promises backwards compatibility. What would that look like for data? And what is the data version of the dependency hell that software developers know all-too-well (M 1.5 depends on N 1.7 and P 2.0, but P 2.0 requires N 2.0, and upgrading N to 2.0 breaks M which expects the 1.x set of interfaces from N …).

And, of course, there’s documentation. As with software, I imagine we’ll see some docs structured and some unstructured. The state of the art isn’t great for government datasets, it has to be said: if you’re lucky you get a “code X means ABCD” but rarely are you told exactly how the data were generated, the limits on its accuracy, situations where it shouldn’t be used, etc.

Finally, we need to change attitudes and social systems. Data is produced as the product of work done, and is rarely conceived of as having a life outside the original work that produced it. Some datasets will (some won’t–think of how many projects fail to interest anyone but the person who started them). This means thinking of yourself not just as the person who does the work, but the person who leads a project of interested outsiders and (in some cases) collaborators and who is building something that will last beyond their time. This is not a natural mindset within government nor, in many cases, science. Funding and budgeting systems at the moment may prevent this, and would need to change.

The good news is that while government datasets are rarely generated collaboratively, science is a little further along. PubMed and GenBank are just two examples of great science collaborations that we can learn from, and I’m sure there are more. Beyond science, OpenStreetMap is an important example of collaborative data gathering and the Open Knowledge Foundation folks may have work in this area already. I’m keen to learn more about the open data projects that are more than just data-over-the-wall and share what I find. Time to stop kicking myself and start learning!

tags: ,
  • Will

    Good article. I agree the next big wave is open data. Version control, accuracy and data cleansing are key. Hopefully the privacy issues won’t stop governments from from making lots of valuable core data open to enable the next generation of services and applications.

  • Mohammed

    This is a good point. Some problems will arise as will in how to handle checking the integrity of the data in an incremental matter – what would be the equivalent of unit tests for example? how do you make sure the additions conform with the rest of the data if they ever should? This provides another way to look at the data that requires a different mindset than the one required in collection, which will improve its validity and produce quality conversations about the data.

  • iricelino

    Nat, thanks a lot for sharing. I totally agree with you: the open data and linked data communities should learn a lot from open source lessons.

    I believe that provenance info, metadata/documentation/quality info about data sources are key to a wider adoption of open data patterns. And I’m confident that the linked data community is working in this direction (yes, here it is the one who claims that RDF can help in solving a lot of problems… :-P)

  • Ken Williams

    Since a lot of data is essentially tabular in form, a nice start would be a tool that could patch a data set with changes to rows *or* columns efficiently. Right now with text-oriented tools like diff & patch (& storage formats like RCS reverse-delta) we can get efficient management of changes to rows but not columns.

    Is Tridge listening?

  • Antti Poikola

    Once again Nat you hit the point!

    I read about the idea of “patching” data in the same way as source code first time frome here: http://datacommoners.blogspot.com/2009/04/big-data.html

    Are there any practical tools for that? What is the state of the art? I’m quite confident that I’ts not hard for the gov people to understand the need for evolving from “data-over-the-wall”, but as far as I know, the tools and good practices are missing.

    -Jogi from Finland

  • Mark Essel

    Excellent post Nat, it got me thinking about data history tracking and how important it is to open data utility.

    We certainly need a “git” for open data tracking.

  • Jessy Cowan-Sharp

    Thanks for a great post. You should specifically check out the datapkg project from OKFN: “datapkg is an user tool for distributing, discovering and installing data (and content) ‘packages’.”

    It’s a nascent effort but seems like a great start.

    One point that really hit home in your article is about the provenance of the data. Most of the data being released now was not gathered with the knowledge it would be shared in this way. I don’t this we can fix this for historic/existing data sets, but perhaps we can encourage those uncertainties to at least be documented. There’s little incentive to do that right now, because everyone wants their “high value” data sets.

  • Ssuan Wilson

    Good, thoughtful post but the issues of information (not just data) integrity, adequate context, and forensics still clutch. Without being able to follow a datum’s lifecycle, reverse correction and propagation will not be possible; bad stuff will will still spread unchecked. Sure this opens boatloads of privacy issues but there should be a way of assuring this.

    An interesting tangent would be determining just who or what would be the arbiter of data/information accuracy integrity. Could be a frisky conversation.

  • Robert Burgholzer

    Wow, this is a really timely piece. In the area of aquatic resource management, “data balkanization” is as pronounced as any other field I have been involved in. I have been working with a group of individuals to borrow some lessons from the Open Source device known as the “code sprint” to help aggregate aquatic resource data – we are calling it a “data sprint”. If you would like to be an observer as we stumble through this process we would welcome it! Below is a brief summary (our wiki supporting this is: http://sifn.bse.vt.edu/sifnwiki/index.php/SIFN_datasprint )

    The Data-Sprint
    The concept of a “code-sprint” is well known throughout the Open Source software community. A code-sprint is a physical and/or virtual meeting to make a concentrated, coordinated effort at enhancing a specific open source software product. Participants gather in an agreed upon time, in a given location, and/or via electronic means to delegate coding tasks, “bug hunts”, and documentation tasks, then to work in a concerted fashion to achieve those tasks in the time allotted. A well organized code-sprint yields great improvements in the software product, and also works to foster community connections among the developers. These events have something for everyone, distributing tasks to match the participants own strengths and areas of interest.

    A “data-sprint” for collecting and aggregating aquatic resource data could work in much the same way. SIFN members could work together to advance the state of the aquatic resource science, by creating a single repository for electronic data that researchers and resource managers can use to help bolster the scientific basis upon which we make our management decisions.

  • J Miller

    Absolutely agreed that relational data needs an equivalent of Subversion (maybe there is one, and I just don’t know about it?).

    The problem with many non-scientific govt datasets is privacy/disclosure. (http://factfinder.census.gov/jsp/saff/SAFFInfo.jsp?_pageId=su5_confidentiality). Census and other socioeconomic data cells frequently need to be “suppressed” in order to protect the privacy of respondents. This means that any “open source” process can only begin after extensive processing and filtering of the data inside the govt agency’s walls. At that point, you’re already several steps removed from the original microdata. There is no solution to this without major privacy violations, which is why a large portion of government data will NEVER be open-sourced even though it is publicly funded.

  • Howard Silverman

    Kaitlin Thaney of Science Commons has a good slide deck on open science and data sharing.

    Also, other presentations from this conference track on ecoinformatics.

  • Chaim Krause

    Any tools should also address the need to have a verifiable data signature to insure that data has not been altered, even by accident or software/hardware malfunction. This should be at the data level and not at the level of the “package” that contains the data.

    I know that all data is not fit for a RDBMS, *but* it would be a good place to *start*. Use SQL:2006/SQL:2008 standards and a reference database engine. Tack on a metadata/data dictionary specification that promotes including all of the relevant information about how/where/when the data was obtained.

    You would then define how the data should be converted to a unicode string and that string could be used as a hash for the data, then hash the hashes for the whole dataset.

    The main thing is to start somewhere and adjust instead of just talking about it. Of course, if there are existing best practices, I would hope somebody will pipe in and point them out.

  • Gregory Dyke

    My PhD thesis is about how not only data but also the analytic representations created on top of that data can be shared (particularly in the field of computer-supported collaborative work).

    Download it here

    http://www.emse.fr/~dyke

    Something of a long read though :/. In my state of the art, the MULCE project is of particular interest (structuring learning corpora for sharing)

  • Yaron Koren

    It’s somewhat amazing to me that, in 2010, there could be a whole discussion about collaborative open data without the term “semantic wiki” coming up. Clearly we evangelists for semantic wikis have more work to do.

  • Steve Holcombe

    Nat,

    Liked this post so much I submitted it as a news item to the Data Ownership in the Cloud group on LinkedIn (http://tinyurl.com/datacloud) and specifically included the paragraph beginning with: “Finally, we need to change attitudes and social systems. Data is produced as the product of work done, and is rarely conceived of as having a life outside the original work that produced it ….” Right on!

    Also, while I was reading your post I couldn’t help thinking about Silona Bonewald’s vision for ‘open banking’. And I have also blogged about that vision at “Silona Bonewald: Open Banking, metrics and money” (http://pardalis.squarespace.com/blog/2009/10/1/silona-bonewald-open-banking-metrics-and-money.html).

    Again, thanks for the post.

  • Robert Kaye

    Nat:

    >What if we ran an open data project like an open source project? What would this look like?

    It would look like MusicBrainz! :-)

    We are open source, open data and even open finances. We’ve been collaboratively working on a huge data set for years and we’re even licensing for commercial use. And we’re in the black!

    Fortunately we had it easy and we started with no data — that sounds rough, but if you dont have to liberate data before you start, it makes it all that much easier!

  • John S. Erickson, Ph.D.

    Thanks Nat for this thoughtful and timely post!

    Although it doesn’t completely answer your questions or cover your challenges, the Memento Project does provide a way to easily and unambiguously reference to previous dated versions of resources, include linked data, on the web. Combined with a presentation overlay that helps contributors make sense of their collective efforts, this might be close to what you’ve described!

  • Jon

    I am working for a large health board in the same country you are posting from Nat, I would have to agree that the biggest limitation in our reporting and and analysis is the accuracy of the dataset. My job is to manage the dataset for our services, to improve the outcomes for our staff and patients.

    The biggest two things I work on now are 1. getting people used to having data available to them, and 2. ensuring that they understand optimal data structures (categorisation, standardisation, and formatting).

    Other things that should have been documented and developed before now, but I have had to put in place are: data quality assurance reporting and documented business practices.

    Within our “business” (the clinicians hate that term), the key issue is the disconnect and general lack of understanding between the people tasked with caring for patients and those tasked with collecting and disseminating information. I am in a very small minority that understands the information collected (I have context) and also how to collect it, and use it.

  • Russell Nelson

    Semantic Web. Nat, you need to get timbl to give you the whole semantic web song and dance. First you start humming, then you start tapping your foot, then your flailing your arms and legs and dancing to the beat.

  • Lisa Green

    Excellent post. Reading it allowed me to see open science / open data from a different perspective.

    I strongly agree that “we need to change attitudes and social systems”. But I would say that when it comes to normative changes in the science community, it is much more than just “This means thinking of yourself not just as the person who does the work, but the person who leads a project of interested outsiders and (in some cases) collaborators and who is building something that will last beyond their time”. That is definitely necessary, but it is not sufficient.

    My background is science and I know the academic science community well. There needs to be substantial changes in the attitudes and social systems of the scientific community before open data is an accept approach.

  • Plepe

    I think that OpenStreetMap is a great example, as you state yourself. You have an easy interface to download and edit the data, you have a lot of programs around it that cater to specific needs (editors, renderers, debugging tools, little apps, big apps, …). You have a versioned history, though it’s a pain in the ass to recover a specific version … Therefore you have planet dumps every know and than.
    Your comparison with git is great and shows the failures of OSM: You do not have branches and you could be better on the history. And it’s not distributed (why this is important? for the same reason why it’s important for software).

    It would be interesting to create a new database system:
    - Distributed, therefore you can just copy a file and give it to your neighbor and merge it with his/her version
    - Simple to use for everybody, with different GUIs , like Access or SPSS.

    Maybe we could build something like this on basis of git?

  • Andrea Di Maio

    Sorry to sound controversial, but I am not sure this is a good idea. If you say that open source projects can give some inspiration for how to develop apps or mashups that use open data, then I’d agree with you. But a main tenet behind open government data is that there is somebody (an agency, a person) who is accountable for that data. You do not want data to be changesd by a community: ot do you want crime stats to be in a wiki thay you and I can modify at will? What you want is to make sure that whoever is in charge for open data uses principles and tools that will make that data accurate and useful. So I’m all for reporting tools and ways to connect data owners with those who use the data, but please make sure you keep the analogy within reasonable boundaries.

  • bob

    Non-trivial > not insignificant/not unsubstantial/far too many/whatever

    Non-trivial means something quite different.

  • Anonymous Coward

    Well written article, one question though?
    What stone age organisations are you working for that do not have the idea of content management systems implemented as part of their production solutions?

    Surely some of this data (structured) is stored in archiveable databases with backups for rerunning historic reports and audits etc

    Nothing new here exactly but worth pointing out

    Unless of course this is just a sneaky way to “open source” data that would otherwise reside under lock and key

  • Kenneth Geisshirt

    When I did scientific research (more than a decade ago) I wondered why no one published the data. Research was conducted, papers written and peer-reviewed and finally published. Scientific papers must include a list of materials and procedures – otherwise it is not possible to reproduce the results. And without reproduction, a scientific finding is worthless.

    Often the data collection is complex, and it is hard if not impossible to reproduce the processes. For example I cannot launch a probe and land it on the Moon. Your idea of open data projects is important for the scientific development.

    Another thing: when scientists publish results, the data is always analyzed using a set of software products. The question is: how do we know if the bugs in the software have an influence on the results? In that respect, open data is important as any scientist can rerun the analysis (using another software product) and see if the result is the same.

    Much science today is done in large teams (CERN, climate monitoring, genome sequencing), and open data ought to be a “natural” thing for these teams.

  • Gerry Creager

    Let me cast one more vote for semantic tools/web/interoperability before I launch into my particular sermon.

    In my experience of late, trying to obtain observations and model data for coastal and ocean systems, I found several things. First, I found that the researchers I’d been tasked to work with were as parochial as those I’d worked with in other disciplines. They had meticulously collected their data, knew it intimately, but kept it captive in the moral equivalent of a personal lab notebook. I’ll explain:
    1. They knew their data, and as I’ve said were intimate with it. Therefore they didn’t need to write down anything to describe it. In other words, they failed to generate the metadata needed to make it usable by others.
    2. They often cited the need to publish as their reason for holding their data close and not sharing it. They did this sometimes for years. (as a digression, I’m aware of a medical researcher who has published some 100 articles… peer reviewed… on the same set of approximately 90 patients by subsetting the patients differently; the uninitiated would think he had a cadre of thousands of patients)
    3. They would claim to be willing to release their data, but you were responsible for its use (assuming they didn’t expect you to come to their lab and physically copy it from its paper form). However, since they recorded little or no metadata, they couldn’t/wouldn’t provide any metadata, so, your use of it required interpolation of metadata, extrapolation of metadata, or creative license to create… metadata.

    In our project, we enforced a few requirements for participation.

    We established a best-practices doctrine that identified a system making use of data similar to ours and in a form that was available and useful, with a data naming convention that provided a convenient starting point for construction of a viable vocabulary and ontology. (CF [climate/forecast] naming convention)

    We agreed upon a set of extensions to CF that would allow us to consistently name/document our data, essentially creating our own extended vocabulary.

    We identified a data format all of us were familiar with, and which had the benefit of, if used as designed, being self-documenting. (NetCDF)

    We agreed upon a method to extend NetCDF to support our particular needs (use of unstructured/finite element grids), and incorporated the data necessary for recreation of the model space as a standard element of the dataset.

    We created and enforced the use of tools that automatically checked data coming into the repository for format and the appearance, at least, of compliance. (Files had to be in verifiable NetCDF form; CF naming structure with our own extensions; etc.)

    We instantiated a simple method of data exchange and automatic archiving. We subscribed to the notion that data were first placed into a secure storage site, immediately registered in a local database, verified for appropriate form, thence registered with a catalog system that was discoverable on the web. This allowed our participants to create a workflow that permitted them to submit data without undue additional effort.

    We allowed all participants unfettered access to the data using SOAP-based services, forms-based web pages, and interactive login’s to make their use of the archive and its resident data as easy as possible, thus achieving utilization (it it’s hard to use, people won’t use it, then they’ll get more lax about contributing to it).

    We allowed select outsiders access to the data and solicited their comments on usability of the data and retrieval systems, and supported enhancements and future development based on their input.

    We met routinely to discuss problems and enhancements.

    It required a lot of work, but I believe the results were worth it. At the time we started, we had few extant tools to work with and developed the ones we needed on the fly. It’s probably time to open-source ‘em and see if anyone will find them useful. They’ve never been formally close-sourced, and they’ve been distributed before, but we developed what we needed on the fly, using perl, Python, bash, csh, C, C++ or whatever language was handy and appeared useful at the time.

    Now, if you’ll pardon me, I’ve gotta go flail my arms at the semantic interoperability chicken dance. It really does seem to grab you.

  • DanKasun

    You have some very good thoughts and comments on this, but I want to point out that the tools and processes you recommend are not really specific to open source.

    Commercial software developers use the same tools and processes – collaboration, version management, patching, branching, dependency tracking, issue tracking, reporting, etc.

    What you’re really recommending is applying the best practices of software development and application lifecycle management to open data solutions (the tenets apply equally to open source and Commercial software development).

    Thinking about the full lifecycle of the data is absolutely the right thing, and is the obvious next step for Open Data (whereas everyone is just thinking about what and how to publish right now). Using the proven processes, tenets and tools that exist in the software development world today would provide a good starting point.

    -Dan

  • Ken Williams

    To Anonymous Coward: you’re right that any serious organization uses some form of content management for their core data. But no serious organization would claim that this really “solves the problem” either. The issues of data origin, transformation tracking, etc. are pretty hairy and basically unsolved at this point.

  • David James

    There are many advantages of open source software, and many creative ideas can be hatched in applying it to open data. I suggest a hands-on approach:

    1. find a data set you are interested in and invested in (it will keep you motivated)
    2. work with it programmatically (write an importer, for example)
    3. clean the data. publish any derivative works you make.
    4. share your code, data, and documentation openly

    Where to share your work? Code repositories such as GitHub are a great place for code and data. Also, I would like to recommend an open source project I’m working on at Sunlight Labs: The National Data Catalog (NDC). If you clean any government data, write tools around it, or make derivative data sets, the NDC will be a great place to share your contributions.

  • George Oates

    Open Library is a project run out of the Internet Archive that’s attempting to take this approach. This editable library catalog has been online since 2007, and the vast majority of our dataset has been imported (and massaged) from libraries around the world. This has created a variety of interesting issues, like variations in cataloging practice creating duplicate entries, and so forth.

    The great thing about Open Library is that it is a wiki, which means errors can be corrected, records can be enhanced, and the entire editing history of any record can be viewed historically.

    The whole dataset is available for free download (although we certainly recognize that there’s not much utility in downloading such a massive dataset, unless you’re an uber nerd). We’re looking for ways to increase utility in this area, perhaps by allowing much smaller subsets of data/records to be extracted, say all the records of books about cheese.

  • Anonymous Coward

    @ Mr Williams, so are you saying we are moving towards or we need smarter meta data that is essentially standardised meta data for “communal” use?

    I would suggest, instead of GitHub and such solutions,
    to start thinking of CouchDB style NoSQL implementations for large, versioned data storage with good meta data design.

    Not a “one size fits all” idea as some may be content with smart database solutions for the same problem.

    A followup article detailing what is used and what could be used would be nice :-)

  • Brenda Chawner

    Nat, it’s great to see you thinking about this and getting people to think about what access to open needs to be successful. You might find some of the thinking that’s gone on at the Digital Curation Centre in Edinburgh helpful, particularly their lifecycle model.

  • Kevin Curry

    Re: “The official dataset of New Zealand school zones has errors but there’s nobody to report them to, much less a way to submit a fix to a maintainer.”

    Sounds like Open311 for gov data.

  • Anthony Townsend

    I think you’ve just laid out a road map for the next decade of open data and open government. Bravo! How can I help make it happen?

  • Denis

    You mean Wikipedia.org?

  • Josh Burgbacher

    I can not see the government going to open source data! Open source is easily hacked, and with companies such as GOOGLE who got hacked last year open source will surely becoming to an end. If you read the Cybersecurity Act of 2009 that the House of Representatives passed, states finding more and better ways to control and protect data. It also states the strict intentions of protecting that data. You can find the bill here http://www.govtrack.us/congress/billtext.xpd?bill=s111-773

  • David Clunie

    The National Cancer Institute has been proactive in the effort to share biomedical image and related data used for cancer research. See for example the description of the National Biomedical Imaging Archive (NBIA) at “https://cabig.nci.nih.gov/tools/NCIA/” and the publicly searchable and open archive itself at “https://imaging.nci.nih.gov/ncia/login.jsf”.

    David

  • Christopher

    I think you’ve invented the librarian :-)

    The point that open data can’t just be ‘tossed over the wall,’ and that it needs to be managed to be useful is a very good one.

    But the piece does make me think of the old ‘to a man with a hammer, everything looks like a nail’ cliché. I have no doubt that the lessons of software development can be productively applied to the world of open data, but there’s an entire profession that already works with information in this way: librarians (and archivists of course).

    If I had to make one suggestion that I thought would make a big difference to the success of open data projects, it’d be to make sure librarians are involved at an early stage. Coders and developers may possess expertise in using (and building!) tools to effectively manipulate data, but librarians have a lot of experience ordering and enabling fine-grained retrieval of information–and they’re often rabidly enthusiastic about open data.

  • Pito Salas

    This might be of interest to folks on this thread:

    http://www.scribd.com/full/14136777?access_key=key-78r8bgninjtjfysd6qy

    It’s one of three papers I wrote about a year ago about a standard protocol or format for data discovery and schema discovery. I haven’t worked on it since then, but invite folks to use it or follow up on it if it looks like an interesting direction.

  • Jakob

    Christopher: I wish you were right. Many librarians that I know are still afraid of or just not interested in the Web and data. Tim Spalding, the founder of another huge collaboratively edited data set (LibraryThing) is right when he tries to push libraries to get into the Web. Of course there are also other librarians (for instance a German library just released 1,3 million records as open data) but they are not the majority.

  • Ian Graham

    A very interesting post. I particularly appreciate the discussion around the need for culture change around data, and the evolving role of librarianship in curating such data collections.

    I also haven’t seen mention of the strong overlap with the open notebook science movement, which aims …to make the entire primary record of a research project publicly available online as it is recorded.. There is a strong relationship between this vision and the goals of open data.

    You also talk about the quality of the data, and the ability of an open community to improve that quality. That is I suspect true only when the community agrees on a definition for ‘data accuracy.’ That’s easy for things like street maps / address information, but not so easy for controversial data sets such as those for climate data, or for culturally malleable concepts (like biographical data for controversial politicians).

    Indeed, wikipedia had to evolve editorial practices to manage around such challenges — open data management would have to develop similar processes.

    That become very important if society has to count on data being accurate, as you wouldn’t want to major decisions based on questionable data.

    So I think a good first step would be to pick a useful and publicly relevant data set and demonstrate that open approaches improve the data quality.

  • Mark Leggott

    I would suggest an approach that incorporates a system like Fedora (http://fedora-commons.org/), which provides support for any data type and has an extremely flexible object model as well as support for the semantic web with their triplestore. The ICPSR work (http://www.fedora-commons.org/confluence/display/DEV/ICPSR+Content+Models+for+Social+Science+data) is an excellent example of what you can do with Fedora and a lot of complex data.

    We have combined Drupal with Fedora (http://islandora.ca/) to create a system for researchers to steward their research data, regardless of what it is. It also provides the collaboration tools that are so critical to creating, analyzing and sharing the data. It puts control for when and what to share in the hands of the creator, which is key to getting researchers to consider this approach. Add in a dash of what libraries provide in terms of information management and I think you have a powerful combo.

  • Sandro Santilli

    An attempt of collaborative approach to data
    is this one: http://www.gadm.org/

    Take a look.

  • Karl E

    Nice post, but “versioning” is not a useful metaphor for keeping track of changes in the data. Most interesting datasets change frequently. Creating a new “version” for each change becomes very cumbersome. Instead, the data itself should provide enough information so that it is possible to see, for example, which experiment it belonged to.
    The reason that software is bundled into “versions” is that it is intended to function as a single entity. Data, however, is more useful the less you have to deal with it as a single unit, and instead work with it anyway you want.

  • Andrew Sunderland

    Good article.

    At the end of the day technology will help solve these issues but I believe its really all a people/process problem.

    People working on data projects need to approach their work with discipline and rigor.

    As you said, ‘open sourcing’ or just exposing short-cuts could be a great first step.