Carl Malamud has this funny idea that public domain information ought to be… well, public. He has a history of creating public access databases on the net when the provider of the data has failed to do so or has licensed its data only to a private company that provides it only for pay. His technique is to build a high-profile demonstration project with the intent of getting the actual holder of the public domain information (usually a government agency) to take over the job.
Carl’s done this in the past with the SEC’s Edgar database, with the Smithsonian, and with Congressional hearings. But now, he’s set his eyes on the crown jewels of public data available for profit: the body of Federal case law that is the foundation of multi-billion dollar businesses such as WestLaw.
In a site that just went live tonight, Carl has begun publishing the full text of legal opinions, starting back in 1880, and outlined a process that will eventually lead to a full database of US Case law. Carl writes:
1. The short-term goal is the creation of an unencumbered full-text repository of the Federal Reporter, the Federal Supplement, and the Federal Appendix.
2. The medium-term goal is the creation of an unencumbered full-text repository of all state and federal cases and codes.
This is clearly public data, but as Carl wrote in a letter to West Publishing that accompanies the first data release on his site, asking for clarification about what information West considers proprietary versus public domain:
In looking through the court decisions of a decade ago where West and your commercial competitors fought over the right to re-publish case law, it seems fairly clear that a large part of the publication stream is tightly interwoven into the very substance of the operation of the courts, with West serving as the either contractual or de-facto sole vendor reporting on behalf of the court.”
Carl’s letter goes on to ask West to release the full text of the Federal Reporter, Federal Supplement, and Federal Appendix. He says:
You have already received rich rewards for the initial publication of these documents, and releasing this data back into the public domain would significantly grow your market and thus be an investment in your future.
Elsewhere in the letter, he writes:
We wish to make this information available to a population that today does not have access to the
decisions of our federal and state courts because they are not commercial subscribers to one of the
handful of services such as your award-winning Westlaw tools. Codes and cases are the very operating
system of our nation of laws, and this system only works if we can all openly read the primary sources.
It is crucial that the public domain data be available for anybody to build upon.
Now, it could be that West will eventually go along. Their real proprietary data isn’t the text of the case law itself so much as it is in their key number system and accompanying summaries, or “headnotes” as well as their value-added tools for searching and managing the voluminous amount of data. But Carl’s project is intended to point out that if they don’t, he’ll be able to make the data public anyway.
(Note: in the last decade, the Federal courts have begun publishing the data on current opinions, and law professor Tim Wu’s AltLaw site provides a full text search engine for those recent cases. But the historical record is much more difficult.)
Carl’s starting point is the “ultrafiche” version of the Federal Reporter, which West published before the advent of online database versions. An ultrafiche presents up to 1000 pages on a 4 by 6 inch transparency, like the one shown below:
Carl has begun enlarging, processing, and publishing the images, and beginning the process of OCRing them to extract the text. After 87x enlargement, the test images are quite readable. (Click the image below to see at full size):
In private email, Carl wrote:
The SEC database was fairly straightforward, taking a couple of
years of hard work. But, getting patents online took 5 years of
drawing lines in the sand and sending shots across the bow. Our
line in the sand here is all state and federal cases and codes, and
I guess our shot across the bow is publishing a 3.6 gbyte tiff file
and announcing our intention to systematically walk through the
5 million or so pages of federal case law.
That’s a big challenge, but with computing power and storage getting ever cheaper, and with the dedication of volunteers like Carl, it does indeed seem like a possible project. (After all, when Carl pressured the SEC to put its Edgar database online in the early 90’s, they said it would take years and millions of dollars. Carl did it in six weeks, and operated the database for two years before persuading the SEC to take it over.)
P.S. John Markoff has covered Carl’s escapades for at least as long as I have, so I wasn’t surprised to see him at Carl’s offices (now on the O’Reilly campus in Sebastopol) last week. His coverage of this story for the New York Times is <a href=http://www.nytimes.com/2007/08/20/technology/20westlaw.html?_r=1&oref=sloginhere.