Steep climb for National Cancer Institute toward open source collaboration

Although a lot of government agencies produce open source software, hardly any develop relationships with a community of outside programmers, testers, and other contributors. I recently spoke to John Speakman of the National Cancer Institute to learn about their crowdsourcing initiative and the barriers they’ve encountered.

First let’s orient ourselves a bit–forgive me for dumping out a lot of abbreviations and organizational affiliations here. The NCI is part of the National Institutes of Health. Speakman is the Chief Program Officer for NCI’s Center for Biomedical Informatics and Information Technology. Their major open source software initiative is the Cancer Biomedical Informatics Grid (caBIG), which supports tools for transferring and manipulating cancer research data. For example, it provides access to data classifying the carcinogenic aspects of genes (The Cancer Genome Atlas) and resources to help researchers ask questions of and visualize this data (the Cancer Molecular Analysis Portal).

Plenty of outside researchers use caBIG software, but it’s a one-way street, somewhat in the way the Department of Veterans Affairs used to release its VistA software. NCI sees the advantages of a give-and-take such as the CONNECT project has achieved, through assiduous cultivation of interested outside contributors, and wants to wean its outside users away from the dependent relationship that has been all take and no give. And even the VA decided last year that a more collaborative arrangement for VistA would benefit them, thus putting the software under the guidance of an independent non-profit, the Open Source Electronic Health Record Agent (OSEHRA).

Another model is Forge.mil, which the Department of Defense set up with the help of CollabNet, the well-known organization in charge of the Subversion revision control tool. Forge.mil represents a collaboration between the DoD and private contractors, encouraging them to create shared libraries that hopefully increase each contractor’s productivity, but it is not open source.

The OSEHRA model–creating an independent, non-government custodian–seems a robust solution, although it takes a lot of effort and risks failure if the organization can’t create a community around the project. (Communities don’t just spring into being at the snap of a bureaucrat’s fingers, as many corporations have found to their regret.) In the case of CONNECT, the independent Alembic Foundation stepped in to fill the gap after a lawsuit stalled CONNECT’s development within the government. According to Alembic co-founder David Riley, with the contract issues resolved, CONNECT’s original sponsor–the Office of the National Coordinator–is spinning off CONNECT to a private sector, open source entity, and work is underway to merge the two baselines.

Whether an agency manages its own project or spins off management, it has to invest a lot of work to turn an internal project into one that appeals to outside developers. This burden has been discovered by many private corporations as well as public entities. Tasks include:

  • Setting up public repositories for code and data.

  • Creating a clean software package with good version control that make downloading and uploading simple.

  • Possibly adding an API to encourage third-party plugins, an effort that may require a good deal of refactoring and a definition of clear interfaces.

  • Substantially adding to the documentation.

  • General purging of internal code and data (sometimes even passwords!) that get in the way of general use.

Companies and institutions have also learned that “build it and they will come” doesn’t usually work. An open source or open data initiative must be promoted vigorously, usually with challenges and competitions such as the Department of Health and Human Services offer in their annual Health Data Initiative forums (a.k.a datapaloozas).

With these considerations in mind, the NCI decided in the summer of 2011 to start looking for guidance and potential collaborators. Here, laws designed long ago to combat cronyism put up barriers. The NCI was not allowed to contact anyone it wanted out of the blue. Instead, it has to issue a Request for Information and talk to people who responded. Although the RFI went online, it obviously wasn’t widely seen. After all, do you regularly look for RFIs and RFPs from government agencies? If so, I can safely guess that you’re paid by a large company or lobbying agency to follow a particular area of interest.

RFIs and RFPs are released as a gesture toward transparency, but in reality they just make it easier for the usual crowd of established contractors and lobbyists to build on the relationships they already have with agencies. And true to form, the NCI received only a limited set of responses, frustrated in their attempts to talk to new actors with the expertise they needed for their open source efforts.

And because the RFI had to allow a limited time window for responses, there is no point in responding to it now.

Still, Speakman and his colleagues are educating themselves and meeting with stakeholders. Cancer research is a hot topic drawing zealous attention from many academic and commercial entities, and they’re hungry for data. Already, the NCI is encouraged by the initial positive response from the cancer informatics community, many of whom are eager to see the caBIG software deposited in an open repository like GitHub right away. Luckily, HHS has already negotiated terms of service with GitHub and SourceForge, removing at least one important barrier to entry. The NCI is packaging its first tool (a laboratory information management system called caLIMS) for deposit into a public repository. So I’m hoping the NCI is too caBIG to fail.

tags: , , , , , , , ,