I’m on the board of CommonCrawl.Org, a nonprofit corporation that is attempting to provide a web crawl for use by all. An interesting report just got sent to us about the use of robots.txt files within the .Gov Top Level Domain, a standard known as the Robots Exclusion Standard.
In examining about 32,000 subdomains in .gov, it turns at least 1,188 of these have a robots.txt file with a “global disallow,” meaning robots are excluded from indexing this content. Even more curious, on 175 of these sites, while there is a global disallow, there is a specific bypass that allows the Googlebot to index the data. You can
look at the raw data on Factual.
At Public.Resource.Org, we’ve always felt that the use of a robots.txt file by the government should only be used for purposes of security and integrity of the site, not because some webmaster arbitrarily decides they don’t want to be indexed. Indeed, on several occasions we have deliberately ignored government imposed robots.txt files because we felt this was an arbitrary and illegal attempt to keep the public out.
And, needless to say, it doesn’t make any sense at all to let in some webcrawlers and not let in others. If this is a reaction to a security/integrity issue, such as limited capacity, the proper thing to do is include in the robots.txt file a comment that can be used by other bots to explain what is going on. For example, it could be perfectly reasonable for a government group faced with limited capacity to ask a robot to limit crawls to a certain number of queries per second and only whitelist crawlers that agree to that condition.
Government webmasters should use the robots.txt file sparingly, and should do so in a non-discriminatory fashion.