Bulk Data Downloads: A Breakthrough in Government Transparency

As those of you who follow my tweets know, I spent last week out in Washington, D.C. meeting
with various folks, attending Transparency Camp, and giving a couple of talks. One of the more interesting meetings was with staffers from
Congressman Mike Honda‘s
office. He represents the area around San Jose, including much of Silicon Valley.

Honda sits on a very interesting subcommittee of Appropriations, the one responsible for the
legislative branch, an enormously powerful place to be since they hold the purse strings for
everything from how much money members get for office supplies to how much money employees get
paid. Honda’s staff told me about an interesting rider the subcommittee was working on which
would require the agencies that the U.S. Congress to distribute their data in bulk.

John Wonderlich from the Sunlight Foundation wrote to me this morning to tell me the provision made it
into the Omnibus Appropriations Bill. This is big news. Honda’s staff told me that the
Congressman had been working on this for a year.

Here’s a link
to the appropriations
bill. (I sure wish they gave us the ability to pull these things up in HTML, go directly to
a bookmarked section, and use
change control
to see what has changed, but that’s another post).

The money quote is this paragraph right here:

*Public Access to Legislative Data* – There is support for enhancing public
access to legislative documents, bill status, summary information, and other
legislative data through more direct methods such as bulk data downloads and
other means of no-charge digital access to legislative databases. The
Library of Congress, Congressional Research Service, and Government Printing
Office and the appropriate entities of the House of Representatives are
directed to prepare a report on the feasibility of providing advanced search
capabilities. This report is to be provided to the Committees on
Appropriations of the House and Senate within 120 days of the release of Legislative
Information System 2.0.

Advanced search is great, and the Legislative Information System 2.0 thing sounds very
good as well, but I was struck by the phrase “bulk data downloads and other means of no-charge digital access to legislative databases” and the specific
reference to agencies. What would it mean if all the bulk data from the Library of
Congress, Congressional Research Service, Government Printing Office, and “the
appropriate entities of the House of Representatives” were made available? I asked
Carl Malamud, who has worked with many of these databases, if this looked like
something real or just another report.

Carl replied:

Wow! This is huge. The language only requires a report, but a report to an Appropriations subcommittee
means a whole bunch, because if they don’t like your report, you don’t get money. (Appropriations
was where the action occurred when we took on the Smithsonian over the Showtime deal. Once
they cut the budget $28 million, they had their attention.) Here’s what this means in

  • The Library of Congress sells a series of
    expensive bulk data products
    including the Copyright Database, card catalog information in XML, and what are
    known as “authority files” which are lists of names, subjects, and other classification
    headings so that all libraries can call things by consistent names. Even though the
    data is public, it is very expensive today. The Copyright Database, for example, costs $86,625 for
    the retrospective and a one-year feed (we

    harvested this in 2007
    as you reported, but
    this would be much easier if they simply provided an FTP server and rsync!)
  • The Government Printing Office sells the Official Journals of Government, which
    we’ve been working very hard on
    harvesting and purchasing.
    If we had $100,000, we would have bought
    one of each long ago. This stuff is the official record of the United States.
    Here’s the full list of
    databases, including the Congressional Record, the Compilation of Presidential Documents,
    the United States Code, and much more. [Editorial note: The NY Times blog just did a piece earlier today about Carl’s quest to reinvent the mission of the Government Printing Office for the 21st Century.]
  • The Congressional Research Service is such a no-brainer. With the exception of
    classified information, who can afford the luxury of paying for some of the best
    research in the world and then just bury it! Taxpayer dollars paid for CRS reports
    and they need to be available. (More on this at the

    Sunlight Foundation
  • Other Entities of the House is the most impressive clause in that whole paragraph.
    My reading is that this clause includes bulk access to broadcast quality video from every
    congressional hearing. And if it doesn’t include that, I wish they’d make it clear
    in report language. In this day and age, you can’t say a committee hearing is public
    if you can’t access it on the Internet. Itty-bitty streaming video using some
    proprietary client/format just doesn’t cut it any more. We ran a pilot
    with 4 house committees to show that this is very doable and makes a huge difference (check out the
    before and after
    shots on this video of Chad Hurley testifying before Congress

On video, I want to add
one more note. Policy on what gets archived and distributed from a committee hearing is
up to committee chairmen. It’s a very decentralized system. So, if we’re serious about
putting broadcast quality video from congressional hearings on-line, the
Legislative Branch Subcommittee
of Appropriations
would be a wonderful place to start. Happy to help
if they need a hand!

And, if we’re going to do video, there is one more administrative entity in the House
that we should call out. The House Broadcast Studio has a huge archive of prior hearings.
We asked Speaker Pelosi is we could run FedFlix on that archive and her staff sounded
very supportive.
(FedFlix is our program to help government agencies: they send us
video, we digitize it, send it back to them. No cost to the government, more data
for the public domain!) It would be great if the report from the House Broadcast Studio
dealt with how they’re going to make their archive of several thousand hearings
available as high-resolution, downloadable video.
Again, happy to help if they need a hand.

Bottom line? This is really great if they can pull this off. Congratulations to
Congressman Honda, as well as to the Sunlight Foundation which I know did some heavy
lifting on this issue. (Sunlight has turned into a remarkably effective lobbyist
in favor of transparency. They’re outgunned by K-Street, but they’re definitely
holding their own!)

When Carl Malamud convened a group of 30 open government advocates at O’Reilly’s offices late in 2007, a lot of the discussion focused on this very topic. The group came up with eight guiding principles on the subject of open data. One of the key points was that it is important when government agencies release bulk data, that they do so in the lowest-level format possible. For example, for the Congressional Record and other official journals of government, we want XML plus images, as opposed to just PDF files or other final-form data.

I’d love your thoughts on what government data should be made available, what formats it should be available in, and what you’d do with it if you had access to it. When I spoke with Congressman Honda’s staff, they made clear that they’d love Silicon Valley’s best ideas for other technological reforms that they can include in future legislation. When you’ve got a Congressman who’s paying attention, that’s a great opportunity! I’m fairly sure that the Congressman will be checking the comments on this post, so it’s your chance to let him know what you think.

P.S. Rob Pierson, Congressman Honda’s Online Communications Director, actually gave a Q&A session at Transparency Camp in which he asked for ideas about how to redesign the Congressman’s web site. He got lots of suggestions, including ways to incorporate twitter and facebook feedback, but I’m sure that there are many more ideas. So in addition to responding with ideas about the bulk data provision in the legislation, this is a great chance to give the Congressman feedback on how he can do a better job listening to you.

Update: While it isn’t clear whether CRS reports would be covered under this provision, Senator Lieberman yesterday wrote a letter to the Senate Rules Committee Chairman asking for greater public access to CRS reports. With the new democratic majority in Congress and the White House, transparency measures are sprouting up everywhere.