Robots.Txt and the .Gov TLD

I’m on the board of CommonCrawl.Org, a nonprofit corporation that is attempting to provide a web crawl for use by all. An interesting report just got sent to us about the use of robots.txt files within the .Gov Top Level Domain, a standard known as the Robots Exclusion Standard.

In examining about 32,000 subdomains in .gov, it turns at least 1,188 of these have a robots.txt file with a “global disallow,” meaning robots are excluded from indexing this content. Even more curious, on 175 of these sites, while there is a global disallow, there is a specific bypass that allows the Googlebot to index the data. You can
look at the raw data on Factual.

At Public.Resource.Org, we’ve always felt that the use of a robots.txt file by the government should only be used for purposes of security and integrity of the site, not because some webmaster arbitrarily decides they don’t want to be indexed. Indeed, on several occasions we have deliberately ignored government imposed robots.txt files because we felt this was an arbitrary and illegal attempt to keep the public out.

And, needless to say, it doesn’t make any sense at all to let in some webcrawlers and not let in others. If this is a reaction to a security/integrity issue, such as limited capacity, the proper thing to do is include in the robots.txt file a comment that can be used by other bots to explain what is going on. For example, it could be perfectly reasonable for a government group faced with limited capacity to ask a robot to limit crawls to a certain number of queries per second and only whitelist crawlers that agree to that condition.

Government webmasters should use the robots.txt file sparingly, and should do so in a non-discriminatory fashion.

tags: , ,
  • Aaron

    While I can agree that government-operated websites on the public Internet should not arbitrarily disallow all (or a select group) of robots from indexing their sites, I also think it is just plain wrong to choose to ignore a site’s robots.txt just because you decide you think it’s ok.

    You should respect it and find a way to work with those agencies to change it. Merely dictating that their decision is arbitrary and unfair is inappropriate. What if everyone just “decides” to ignore the disallows whenever they feel like it?

  • Scott

    So a government agency can use our tax dollars to create content for the public good, and only allow Google to crawl and index that content?

    In an ideal world, yes, it makes sense to ask everyone to change their defective and misconfigured robots.txt files.

    Obviously, this isn’t an ideal world.

    Sometimes practical and policy considerations need to take precedence over the huge number of agencies that are clueless, under-resourced, and might fulfill requests to change their robots.txt files sometime in the Spring 2015, if you submit a request today (if ever).

  • Erik Hetzner

    One of the problems with robots.txt is that they make policy decisions about the dissemination and re-use of information, but they are typically put in place arbitrarily by webmasters, who barely give the robots.txt file a moment’s thought. Another problem is that they are used as a “security” measure, which they are certainly not – attackers do not honor robots.txt!

    Aaron – should government webmasters be allowed to dictate to organizations or citizens whether or not their content is used? Organizations which wish to crawl the web should work with site owners to ensure that sites are not adversely affected by crawling, but that does not mean that we should be restricted by robots.txt from crawling otherwise publicly available content because of arbitrary decisions by webmasters.

    You might be interested in the report of the Section 108 study group convened by the Library of Congress and the U.S. Copyright Office, particularly IV.A.2.f.iv.(e)(2) (yikes!), namely:

    (2) Exception to opt-out for political and government sites
    In order to ensure that online works of government and political organizations
    can be preserved, the Study Group recommends that those organizations not be allowed
    to opt out of capture and preservation of their websites. Libraries and archives
    should be permitted to capture publicly available online content from sources such
    as the following regardless of whether the rights holder(s) opt out:
    Federal, state, and local government entities;
    Political parties;
    Campaigns for elected office; and
    Political action committees (as defined in relevant law).

    I know that there are some other organizations that have made the decision to ignore robots.txt files on government sites, relying on the recommendation of the Section 108 study group.

  • http://szabgab.com/ Gabor Szabo

    Using robots.txt for security? Will intruders stay out because you ask nicely? If I was looking for “interesting” information I would pay special attention to the areas that are excluded by the robots.txt.

  • John

    Is robots.txt really a ‘recognized’ standard? http://blog.sherifmansour.com/?p=16

  • http://strivinglife.com/ James

    “Indeed, on several occasions we have deliberately ignored government imposed robots.txt files because we felt this was an arbitrary and illegal attempt to keep the public out.”

    What a poor decision. Arbitrary or not, they setup the robots.txt and made certain decisions in that file. In my opinion, ignoring it discredits your crawler, and like other crawlers that disobey or ignore the robots file, should be blocked completely.

    “And, needless to say, it doesn’t make any sense at all to let in some webcrawlers and not let in others.”

    False. Googlebot (for the sake of example) is at a much higher level than one of the hundreds of miscellaneous crawlers out there. If I don’t want Slurp crawling my site (assuming it decides to follow robots.txt) because it’s causing problems with grabbing content too fast, despite limits in said file (following their specs), then so be it.

    If robots.txt is ignored by crawlers, but is our only method to ‘talk to’ them, then let’s just drop robots.txt entirely.

    What’s the real issue here?

  • http://www.timeatlas.com Anne

    I also wonder if any of these global disallow robots.txt files are the results of migration. For example, content is pushed from a development server that has a global disallow robots.txt file that overwrites the production version. I’ve seen it happen in non .gov sites with some frequency. No one notices till content drops from the search engines.

  • http://www.soliantconsulting.com/ Thomas Andrews

    Robots.txt is not a security measure.

    Robots.txt is meant to be a guide for well-behaved robots. It is not and can not be enforced by the site.

    In fact, if you use robots.txt as a security measure, the file itself maps out to hackers precisely the areas of your site you are trying to protect.

  • Ken Hansen

    If I tilt my pointy tin foil hat just right, the name that springs forward is Eric Schmidt, President of Google and senior member of then-Senator Obama’s campaignfor President of the United States…

    But, if I take off my pointy tin-foil hat, I’m left with two distinct impressions – first, the absolute blocks and googlebot-only robots.txt most likely left behind from the development/testing of the site pre-production and were innocently left behind, and second is that only a true fool thinks that a deny-all robots.txt makes anything secret – it only makes it less easy to find.

  • http://www.famouscastles.net/ Ajeet

    I am am amused by this finding. The “do no evil” dictum clearly does not extend to google-users :) I hope that the truth is as Ken Hansen suggests, i.e, the absolute blocks and googlebot-only robots.txt are accidental

  • Erik Hetzner

    James above is ignoring the crucial distinction between a web site run by a private entity and one run by the government. Certainly he is not arguing that it is fine for the government to favor a certain commercial entity over others?

    These mistakes are accidental, but they have a very real effect, and commoncrawl.org is quite correct (in my personal opinion) to carefully consider the consequences and ignore robots.txt when necessary.

  • http://public.resource.org/ Carl Malamud

    Hi -

    Just wanted to be clear about this whole ignoring robots.txt thing. This was not the commoncrawl robot that ignored a robots.txt, it is Carl Malamud of Public.Resource.Org who, after very careful consideration, manually exercised a series of wget’s on some selected databases in order to provide bulk access to government data. I happen to be on the board of commoncrawl, but they are a well-behaved crawler and obey all the rules.

    When I have decided to ignore a robots.txt, it has always been after looking carefully at the web site, the enabling legislation for the agency, agency officials.

    And, when I do a crawl of this sort, I send in email to an agency official letting them know exactly what I’m doing and giving them my cell phone number in case there are problems.

    An example of such a database is the copyright data we made a copy of two years ago.

    Carl

  • Erik Hetzner

    Sorry to have perpetuated the confusion.

    You should definitely have a look at the Section 108 report to see if you think it would appropriate to follow their recommendations regarding robots.txt on government web sites.