Previous  |  Next

Sat

Nov 17
2007

Tim O'Reilly

Tim O'Reilly

Bulk Access to Government Printing Office Data

Carl Malamud of public.resource.org just wrote to let me know that he's begun harvesting, and making available for bulk download, all the data currently being provided by the Government Printing Office, including the Congressional Record, various presidential papers (up until 2004, when they stopped being made public), the Federal Register, and other government documents. These have been available one at a time from the GPO, but now, Carl is making them available for bulk download. This is useful for anyone who wants to do text analysis. Carl wrote:

With help from the Institute of WGET and the fine folks at Ibiblio, we now have bulk interfaces to the Government Printing Office up and running. Here's the Harvest Report.

As you may know, people have been complaining for years that these databases haven't been available in bulk for free download. Most people thought there were two solutions available to this problem:

1. Ask GPO's help to make the data available in bulk by, e.g., a library.

2. Ask GPO to provide the service themselves.

We called up the Government Printing Office and presented them with a third alternative: we were planning on harvesting their data straight through the user interface. This cost zero resources for the GPO, so their answer was "knock yourself out." They even gave us tech contacts in case we had problems.

Turns out sometimes all you have to do is ask ...



tags: carlmalamud, government, gpo, public.resource  | comments: 4   | Sphere It
submit:

 

0 TrackBacks

TrackBack URL for this entry: http://orm3.managed.sonic.net/mt/mt-tb.cgi/2632

Fran Toolan   [11.17.07 11:28 AM]

the GPO became a client of ours about 2 years ago to collect and distribute bibliographic metadata on their titles via ONIX. Since that time they have created excellent bibliographic information for about 4,000 titles.

While this is probably a drop in the bucket of what will come through the interface you mention above, if anyone wants that data they should contact eloquence@qsolution.com

Thomas Lord   [11.17.07 07:24 PM]

Thank You! Great job! I hope that you keep it up. I only worry that it might come across as arrogant to say these things that barely need saying.

-t

Cass Hartnett   [11.20.07 09:37 AM]

One of the real values of GPO's data is that files are, in fact, open for public download and harvesting -- and I applaud anyone who devises improved access to this freely-available stuff, via bulk harvesting or other means.

The writer's parenthetical comment about "[various presidential papers] being available up until 2004, when they stopped being made public] is extremely misleading. All of the files available at http://www.gpoaccess.gov/executive.html are up to date. The bound set known as Public Papers of the President takes time to prepare and publish, and has had a traditional "publication lag" of years ... sometimes more than the current 3 year lag (this has been the practice for decades). So don't worry, President George W. Bush's Public Papers will continue to be produced and disseminated on schedule. I would be very surprised indeed to be corrected on this point.


Carl Malamud   [11.20.07 10:14 AM]

Cass Hartnett is correct on the public papers. Our mirror merely moved what was available from GPO from their WAISgate interface and threw it over the wall where it can be accessed with other protocols, such as rsync, ftp, and bittorrent.

Why bother? WAISgate lets you do keyword searches and build a URL to a particular page. But, the URLs are clunky, you can't get all the pages in bulk without a whole lot of trouble, and you're dependent on the somewhat ancient GPO infrastructure. In Web 2.0 terms, it is as if one could access the Wikipedia only from the web site run by the Foundation instead of incorporating that data into new things like Freebase.

What we did was move that stuff over the wall so people can begin incorporating the data into their services. For example, Google could now spider the data, whereas previously they were prohibited by the GPO robots.txt file.

Is there's a technical contribution in this and in a previous copyright episode, I'd suggest it is establishing the principle that just because government posts a robots.txt file doesn't mean you can't crawl their data. And, perhaps we can even stretch this towards a broader principle: government can sell whatever data they want at whatever price they want, but they also have to give it away, ensuring that there is always a public outlet for public data. (FWIW, that's just not my crazy idea, it is actually enshrined in Circular A-130.)


Post A Comment:

 (please be patient, comments may take awhile to post)




Remember Me?


Subscribe to this Site

Radar RSS feed

BUSINESS INTELLIGENCE

CURRENT CONFERENCES