Carl Malamud
Carl Malamud is the founder of Public.Resource.Org, a nonprofit that has been instrumental in placing government information on the Internet. Prior to that he was the Chief Technology Officer at the Center for American Progress and was the founder of the Internet Multicasting Service, where he ran the first radio station on the Internet.
Fri
Nov 20
2009
Robots.Txt and the .Gov TLD
by Carl Malamud | @CarlMalamud | comments: 9
I'm on the board of CommonCrawl.Org, a nonprofit corporation that is attempting to provide a web crawl for use by all. An interesting report just got sent to us about the use of robots.txt files within the .Gov Top Level Domain, a standard known as the Robots Exclusion Standard.
In examining about 32,000 subdomains in .gov, it turns at least 1,188 of these have a robots.txt file with a "global disallow," meaning robots are excluded from indexing this content. Even more curious, on 175 of these sites, while there is a global disallow, there is a specific bypass that allows the Googlebot to index the data. You can look at the raw data on Factual.
At Public.Resource.Org, we've always felt that the use of a robots.txt file by the government should only be used for purposes of security and integrity of the site, not because some webmaster arbitrarily decides they don't want to be indexed. Indeed, on several occasions we have deliberately ignored government imposed robots.txt files because we felt this was an arbitrary and illegal attempt to keep the public out.
And, needless to say, it doesn't make any sense at all to let in some webcrawlers and not let in others. If this is a reaction to a security/integrity issue, such as limited capacity, the proper thing to do is include in the robots.txt file a comment that can be used by other bots to explain what is going on. For example, it could be perfectly reasonable for a government group faced with limited capacity to ask a robot to limit crawls to a certain number of queries per second and only whitelist crawlers that agree to that condition.
Government webmasters should use the robots.txt file sparingly, and should do so in a non-discriminatory fashion.
tags: gov2.0, open source, search
| comments: 9
submit:
Sun
Nov 8
2009
Unlikely Group Working Happily Together To Solve Patent Problem
by Carl Malamud | @CarlMalamud | comments: 4
People following the issue of open sourcing the U.S. Patent Database might have been surprised to read an announcement in the official business opportunities web site of the U.S. Government: Synopsis for Public Data Dissemination Sole Source Contract to Google, Inc.
While the first reaction of many might be "OMG, WTF, how could they," this is actually good news, with an unlikely cast of characters working together including Google, Intellectual Ventures, and the Internet Archive.
In September, the Patent Office announced a rather strange "Request for Information" (RFI). Under this proposed scheme, the Patent Office would receive a substantial (upwards of $10 million!) donation of equipment from a vendor. In return, the vendor would get to be the official distributor of the patent database to the public, and would get to sell "value-added products." Among other things, the vendor would get access to the patents before the public does, allowing them to mine the database, and would be allowed to sell a variety of bulk products.
While the RFI makes a nod to public access, like all these Zero-Dollar deals the government cuts, there would be a lot of limits on what is "public" data as the vendor tries to recoup their investment by selling the so-called "value-added" products. Readers may remember a similar fiasco with the General Accountability Office where the Federal Legislative Histories were given away to Thomson West and now even the U.S. Congress has to pay to access this material.
The patent database is no ordinary database. This is the only database specifically called out in the U.S. Constitution as being the responsibility of the U.S. Executive Branch to run! A lot of people think this Zero-Dollar deal the Patent Office is contemplating kind of stinks, and I'm really pleased to announce that a broad coalition has come together to make this data more broadly available immediately:
- Intellectual Ventures, the IP group founded by Nathan Myhrvold, is donating several terabytes of the back file to Public.Resource.Org, the Internet Archive, and a variety of other groups to make available to everybody.
- Google asked for permission to crawl the public application system (known as "PAIR"). The announcement by the Patent Office of a "sole source contract to Google" was the government's way of saying we have permission to crawl their system and bypass the CAPTCHAs. This is good news, because the PAIR system contains the "binders," which is all the material that supplements the basic applications and grants.
- The Internet Archive has set aside a boatload of disk drives to serve this data. In addition, Public.Resource.Org will provide the usual rsync and FTP, and we expect a variety of other groups to provide mirrors both for bulk access and end-user systems.
It goes without saying that Google, the Internet Archive, and Intellectual Ventures are 3 groups that don't often work together, and I think this illustrates the compelling public interest in making the patent database more broadly available. We announced this Section 8 Task Force in a letter to Congressman Mike Honda. And, we also sent in a FOIA request to the Patent Office, putting them on notice that we expect any responses to their RFI $0 boondoggle to be made available to the public, as required by law.
In the long-term, Patent Office just needs to fix their system instead of resorting to silly $0 deals. They have 600 staff in Information Technology and spend hundreds of millions of dollars. Surely, they can find a way to serve the public as part of that? Putting a lien on the Patent database in return for $10 million in hardware instead of fixing their 70's-era mainframes just doesn't make sense.
In the meantime, we should have the first 8 terabytes of data up pretty soon. Those interested in learning more about the issue are urged to consult the paper trail on our PTO page which includes letters to and from Congress, and pointers to the Patent Office procurement docs.
tags: gov2.0, open data, open source
| comments: 4
submit:
Thu
Oct 15
2009
Law.Gov: America's Operating System, Open Source
by Carl Malamud | @CarlMalamud | comments: 13
Public.Resource.Org is very pleased to announce that we're going to be working with a distinguished group of colleagues from across the country to create a solid business plan, technical specs, and enabling legislation for the federal government to create Law.Gov. We envision Law.Gov as a distributed, open source, authenticated registry and repository of all primary legal materials in the United States. More details on the effort are available on our Law.Gov page.
The process we're going through to create the case for Law.Gov is a series of workshops hosted by our co-conveners. At the end of the process, we're submitting a report to policy makers in Washington. The process will be an open one, so that in addition to the main report which I'll be authoring, anybody who wishes to submit their own materials may do so. There is no one answer as to how the raw materials of our democracy should be provided on the Internet, but we're hopeful we're going to be able to bring together a group from both the legal and the open source worlds to help crack this nut.
The idea for Law.Gov seems to be getting a good reception in Washington, D.C. Senator Lieberman, writing on behalf of the Senate Committee on Homeland Security and Governmental Affairs, the committee responsible for the E-Government Act, has already accepted our request to submit our report to the Committee. Additional formal requests to submit the completed report are outstanding.
Law.Gov is a big challenge for the legal world, and some of the best thinkers in that world have joined us as co-conveners. But, this is also a challenge for the open source world. We'd like to submit such a convincing set of technical specs that there is no doubt in anybody's mind that it is possible to do this. There are some technical challenges and missing pieces as well, such as the pressing need for an open source redaction toolkit to sit on top of OCR packages such as Tesseract. There are challenges for librarians as well, such as compiling a full listing of all materials that should be in the repository.
Law.Gov is an outgrowth of 3 years of work we've done at Public.Resource.Org along with our numerous colleagues in the open law movement across the country. There have been a series of piecemeal successes which have demonstrated that there is a demand and a need for more legal information to be more broadly available. I'm hopeful now that a truly national movement may have coalesced and that there is at least a chance we can bring this across the finish line and create a new function inside of government, the publication of America's operating system on an open source platform.
The factor that made this coalesce was the recent Government 2.0 Summit put on by Tim O'Reilly. I gave a talk at that summit about the need to put primary legal materials on-line, and it was gratifying to hear the Deputy CTO of the United States, in his closing keynote, highlight that as one of the issues which he thought the White House should help make real through their "moral authority and convening power." The Government 2.0 Summit was also an example of convening power, and I was very pleased that it was more than yet another conference about open government, it was a forum that brought together people interested in creating real change. Tim O'Reilly, as the Convener-in-Chief, should be congratulated, and I'm hoping that future Summits lead to even more concrete results.
tags: gov2.0, open government
| comments: 13
submit:
Sat
Oct 10
2009
Larry Lessig and Naked Transparency
by Carl Malamud | @CarlMalamud | comments: 5
Larry Lessig had a dream. In this dream, he was standing on K Street, preaching in the dark. Suddenly, a naked posse on Segways went whizzing by, shining their flashlights in people's faces. Bystanders were all blinded by these random lights and lost their night vision. When Larry turned around, the naked posse was racing towards the White House for an open government rally, trailed by a screaming mob of marijuana-smoking birthers.
Larry Lessig wrote up his dream in a cover article for the New Republic entitled “Against Transparency: The perils of openness in government.” I suspect that this article will cause some angst inside the Beltway, where you're either with us or against us. But, before the posse turns into a lynch mob, it is important to give the article a careful read.
tags: gov2.0, open government, transparency
| comments: 5
submit:
Tue
Oct 6
2009
Questions (and Answers!) About the Federal Register
by Carl Malamud | @CarlMalamud | comments: 3
When the White House retweets Cory Doctorow, you know something unusual has happened. As many of you saw, the Office of the Federal Register announced that source code for the Federal Register is now available in bulk—for free—and has been converted to XML. Ed Felten's shop at Princeton created a site called fedthread.org to see what you can do with the data and Public.Resource.Org helped the Government Printing Office in testing early stages of the XML work.
All-in-all, a nice piece of public-private cooperation and an important step towards open source America's operating system, and I figured that was the end of that. So, imagine my surprise when I got a call from the White House saying they were making Raymond Mosley, Director of the Office of the Federal Register (OFR) and Michael L. Wash, the Chief Information Officer of the Government Printing Office (GPO) available just in case there were any technical questions from the net.
I gathered questions from a variety of sources, including on-line discussion groups and twitter, and have been doing email back and forth with both Ray and Mike. Hope this is useful (it certainly has been fun to do)!
tags: gov20, open government, open source
| comments: 3
submit:
Thu
Aug 27
2009
PACER Petition
by Carl Malamud | @CarlMalamud | comments: 0
Law librarians from Georgetown and Stanford Law Schools are getting ready to deliver a petition from several hundred law libraries to the Administrative Office of the Courts, the group that administers the federal judiciary's PACER system. They have a goal of 1,000 signatures. If you have a few minutes, look over the petition and if you agree with it, I'm sure the organizers will appreciate your support. The petition asks for some pretty reasonable things from the federal judiciary: signatures on documents, copies of the dockets to federal libraries, and a better way to disseminate the data.
Click Here to Sign the PACER Petition.
Appointments have already been set up for just after Labor Day for the petition to be delivered in person to the Administrative Office of the Courts, with briefings about the content to the Government Printing Office, House, and Senate staff. Your support will help add important weight to this message.
tags: gov2.0
| comments: 0
submit:
Sun
Jul 19
2009
A Crowd-Sourced National Communications Census
by Carl Malamud | @CarlMalamud | comments: 9
My last tour of duty in DC was Chief Technology Officer at the Center for American Progress. One of the fun things I got to do was figure out what everybody else did, including my fellow Senior Fellows, the folks that generated most of the policy work, many of whom are now occupying senior posts in the new administration.
One of the most fascinating was Mark Lloyd. An experienced Emmy-winning television producer, communications lawyer, and community activist, Mark is the author of a well-regarded book about communications and democracy and numerous columns. He's currently at the Leadership Conference for Civil Rights.
The project Mark Lloyd was working on was a National Broadband Map to show our true communications capabilities. And, he wanted to crowd-source the map from community groups, supplementing that with census and other data from several different places to create a big mash-up. This was in 2005, around the same time Adrian Holovaty was thinking about chicagocrime.org.
I think the time is now ripe for this project, and when the new folks at the FCC asked me what I thought they should look at I pitched Mark's idea (they're reaching out to lots and lots of folks, which is a great sign). I asked for posting privileges here at Radar so I could pitch the idea to the Internet as well since I'm taking your name in vain as the folks that would make this happen.
tags: broadband, fcc, geo, iphone
| comments: 9
submit:














