For the past 5 years, I’ve haunted the halls of the U.S. Congress with a geeky ask:
broadcast-quality video from all congressional hearings should be posted on the Internet.
I gave a tech talk at Google, drew up
business plans (pdf) to start a new nonprofit,
enlisted the help (pdf)
of the Public Printer, and harassed my friends in the
media and my friends working for the former Speaker (pdf).
My motivation has been a deeply felt belief that one should not have to live inside the Washington, D.C.
beltway in order to observe the proceedings of the U.S. Congress. No matter what our political beliefs,
no matter how much we disagree on the issues, we must all agree that the business of the Congress is the
business of the People. Today, that means that business must be conducted so that it is visible on
Today, we are announcing a new site, House.Resource.Org.
This site contains today over 500 hearings we obtained from C-SPAN from the proceedings of the House
Committee on Oversight and Government Reform. Under an agreement reached with Chairman Darrell Issa
and Speaker of the House John A. Boehner, we are now in receipt of several
hundred more high-resolution files from 2009 and 2010 hearings that will be loaded on the site. In addition,
the Committee has agreed to furnish us with high-resolution files from all hearings in 2011, which we
will be posting on a weekly basis. Note that this is not a real-time service, we are posting big files
A letter received today from Chairman Darrell Issa and Speaker of the House John A. Boehner
states that it is their hope “that this project is only the beginning of an effort to eventually
bring all congressional committee video online.”
On a technical note, house.resource.org serves the files as HTTP, RSYNC, and FTP. We’ve also put
in place many of the official GPO transcripts as signed PDF and as raw text. If you’d like to view
the files, you’ll be able to do so on YouTube,
the Internet Archive,
and on C-SPAN. We also expect other organizations
to make use of this material. The C-SPAN video is licensed for non-commercial attribution use and the
material from the Congress is in the public domain.
We have two hacks that I think are fairly significant. First, copies of this data are being furnished
on disk drives to the Office of House Preservation and the National Archives and Records Administration,
officials of both organizations happily accepting this addition to our nation’s permanent record. It
is our hope that archivists, librarians, and historians will make good use of this material.
The second hack is something we are doing that leverages some amazing work being done by the
YouTube engineering team. In many cases, we’ve been
able to take the video of a hearing and mash it up with the official GPO transcript. Look at this
embedded video of a hearing about the AIG Collapse and Federal Rescue:
This video took the text version of the official transcript and we hacked it up by hand to
contain a version of the transcript without timecodes.
We cut out any embedded prepared statements, turned the name of the speakers from raw text into a more
cc-friendly [Speaker Name], typed in any commentary at the beginning from C-SPAN, and then the
whole thing was fed into YouTube’s magic transcription engine. What popped back out was
timecode-aligned closed captions based on the official transcript
suitable for use in your own accessibility applications, to use as a search tool into the video, or
as the basis for translation into other languages.
There are a few limits on this magical service as it is in early beta. We don’t have official
GPO transcripts yet for all the hearings, and the timecode-alignment engine is still limited to
videos that are 90 minutes or less. But, there is great hope in this technology to provide accessible
video not only to the workings of Congress but to the workings of any deliberative body that uses
official transcripts, such as courts, city councils, and state legislatures. There is an added bonus,
which is that having such a large trove of verified transcripts that we can align with video means
that this text can be used to train the machine-transcription engine to be more accurate by comparing
what the software recognized with what was actually said.
If you would like to help on the process of prepping transcriptions, please contact me. (Hint: my
email address is on the Public.Resource.Org about page, or contact me on
Twitter where I trade as @carlmalamud.) In terms of
timing, we should have the backfile fully loaded by the end of January. We’re expecting our first
shipment of current hearings by mid-month, and this service should be fully operational by end of
February. Right now, we’re just doing the House Oversight Committee, but we have the capacity
to do one or two more committees, so the service may expand quickly.