Carl Malamud

House.Resource.Org

Hundreds of high-res videos from House Oversight Committee hearings will be available on a new website.

by @CarlMalamud  | Comments: 2 5 January 2011

For the past 5 years, I've haunted the halls of the U.S. Congress with a geeky ask: broadcast-quality video from all congressional hearings should be posted on the Internet. I gave a tech talk at Google, drew up business plans (pdf) to start a new nonprofit, enlisted the help (pdf) of the Public Printer, and harassed my friends in the mainstream media and my friends working for the former Speaker (pdf).

My motivation has been a deeply felt belief that one should not have to live inside the Washington, D.C. beltway in order to observe the proceedings of the U.S. Congress. No matter what our political beliefs, no matter how much we disagree on the issues, we must all agree that the business of the Congress is the business of the People. Today, that means that business must be conducted so that it is visible on the Internet.

Today, we are announcing a new site, House.Resource.Org. This site contains today over 500 hearings we obtained from C-SPAN from the proceedings of the House Committee on Oversight and Government Reform. Under an agreement reached with Chairman Darrell Issa and Speaker of the House John A. Boehner, we are now in receipt of several hundred more high-resolution files from 2009 and 2010 hearings that will be loaded on the site. In addition, the Committee has agreed to furnish us with high-resolution files from all hearings in 2011, which we will be posting on a weekly basis. Note that this is not a real-time service, we are posting big files after-the-fact.

A letter received today from Chairman Darrell Issa and Speaker of the House John A. Boehner states that it is their hope “that this project is only the beginning of an effort to eventually bring all congressional committee video online.”

On a technical note, house.resource.org serves the files as HTTP, RSYNC, and FTP. We've also put in place many of the official GPO transcripts as signed PDF and as raw text. If you'd like to view the files, you'll be able to do so on YouTube, the Internet Archive, and on C-SPAN. We also expect other organizations to make use of this material. The C-SPAN video is licensed for non-commercial attribution use and the material from the Congress is in the public domain.

We have two hacks that I think are fairly significant. First, copies of this data are being furnished on disk drives to the Office of House Preservation and the National Archives and Records Administration, officials of both organizations happily accepting this addition to our nation's permanent record. It is our hope that archivists, librarians, and historians will make good use of this material.

The second hack is something we are doing that leverages some amazing work being done by the YouTube engineering team. In many cases, we've been able to take the video of a hearing and mash it up with the official GPO transcript. Look at this embedded video of a hearing about the AIG Collapse and Federal Rescue:

This video took the text version of the official transcript and we hacked it up by hand to contain a version of the transcript without timecodes. We cut out any embedded prepared statements, turned the name of the speakers from raw text into a more cc-friendly [Speaker Name], typed in any commentary at the beginning from C-SPAN, and then the whole thing was fed into YouTube's magic transcription engine. What popped back out was timecode-aligned closed captions based on the official transcript suitable for use in your own accessibility applications, to use as a search tool into the video, or as the basis for translation into other languages.

There are a few limits on this magical service as it is in early beta. We don't have official GPO transcripts yet for all the hearings, and the timecode-alignment engine is still limited to videos that are 90 minutes or less. But, there is great hope in this technology to provide accessible video not only to the workings of Congress but to the workings of any deliberative body that uses official transcripts, such as courts, city councils, and state legislatures. There is an added bonus, which is that having such a large trove of verified transcripts that we can align with video means that this text can be used to train the machine-transcription engine to be more accurate by comparing what the software recognized with what was actually said.

If you would like to help on the process of prepping transcriptions, please contact me. (Hint: my email address is on the Public.Resource.Org about page, or contact me on Twitter where I trade as @carlmalamud.) In terms of timing, we should have the backfile fully loaded by the end of January. We're expecting our first shipment of current hearings by mid-month, and this service should be fully operational by end of February. Right now, we're just doing the House Oversight Committee, but we have the capacity to do one or two more committees, so the service may expand quickly.

Comments: 2

Scott Kraz [ 6 January 2011 10:13 AM]

Is it necessary or even wise to go with high resolution video? It seems like lower res video would be adequate for most hearings, and easily decrease the amount of manual labor by a factor of 2 or 4.

It almost seems like we're being snowed over with too much data here. Granted electronic transcription simplifies this problem.