In an earlier post, I said that key to government opening its data to citizens, being more transparent, and improving the relationship between citizens and government in light of our web 2.0 world was ensuring content on government sites could be easily found in search engines. Architecting sites to be search engine friendly, particularly sites with as much content and legacy code as those the government manages, can be a resource-intensive process that takes careful long-term planning. But
two keys are:
- Assessing who the audience is and what they’re searching for
- Ensuring the site architecture is easily crawlable
Crawlability Quick Wins
This post is about quick wins in crawlability. In many cases, ensuring crawlability also ensures accessibility (particularly access via screen readers). From this standpoint, many government web sites have an advantage over other sites since they already build in many accessibility features. Creating search-friendly sites also improves usability and user access from mobile devices and slow connections. So forget everything you may have heard about how you have to sacrifice user experience for SEO. SEO done right facilitates deeper audience engagement, makes it easier for visitors to navigate and find information on the site, and provides access to a wider variety of users.
Use XML Sitemaps
Create XML Sitemaps that list all the pages on the site and submit them to the major search engines.
Why is this important? Many government sites have poor information architecture. Ideally each page of the site should have at least one link to it. This helps users navigate the site and helps search engines find all of the pages. Long term, these sites should revamp their navigational structure so that at least one link exists to every page. Since that may take some time to implement, an XML Sitemap can function in the meantime to provide a list of all pages for search engines to crawl.
Government sites have already made great progress in search by using XML Sitemaps.
The Energy Department’s Office of Science and Technology (OSTI) implemented XML Sitemaps protocol with great success. “The first day that Yahoo offered up our material for search, our traffic increased so much that we could not keep up with it,’ said Walt Warnick, OSTI’s director.
If possible, provide an HTML sitemap as well, which provides a browsable navigation to site visitors. Below is a good example of a browsable HTML sitemap on nih.gov:
Don’t block access to content
Make all content available outside of a login, registration form, or other input mechanism. Search engine crawlers can’t access content behind a login or registration. If the content requires the visitor to enter an email address or otherwise provide input before accessing it, it won’t show up in search results.
Avoid dead ends when moving content
When content moves, change the links within the site to point to the new location. Implement a 301 redirect from the old page to the new page. A 301 redirect is a server code that lets browsers (and search engines) know that a page has moved. Some servers send 302 codes by default instead. A 302 redirect is a server code that indicates that the move is temporary, so search engines tend not to index those pages. For instance, the Myelodysplastic Syndromes Treatment page on the NIH site isn’t indexed by Google and the NIH website doesn’t appear at all on the first page of Google results for a search for [myelodysplastic syndromes].
This could be in part because links to this page are actually to http://health.nih.gov/viewPublication.asp?disease_id=85&publication_id=869&pdf=no, which then executes a 302 redirect to the destination page. Likely this is the way the site’s content management system works. A database query triggers the actual page that should appear.
This architecture could be made significantly more search engine friendly simply by the change from a 302 code to a 301 code.
Use descriptive ALT text for images
Of course, using the ALT attribute for images is useful for more than search engine purposes. it also improves accessibility overall and makes the site easier to use from screen reader and slow connections. Government web sites generally do a fairly good job with this. But my earlier post on whitehouse.gov describes how full swaths of text are hidden by images.You can see that the Texas tourism site loses its navigation entirely with images turned off:
This has resulted in the pages linked to from the navigation not to be indexed (because search engines can’t access the links and follow them):
Ensure all links are working and that the server is responsive
This, of course, is good practice for usability as well as for search engine optimization. You can find reports for both broken links and server issues in Google Webmaster Tools and Microsoft Live Search Webmaster Tools.
Ensure each page has a unique title and meta description that accurately describe the page
Again, looking at the Texas tourism site, you can see how lack of these elements creates a poor user experience in the search results:
You can also see this with the HSTAT site. The title tag (and heading) of this page is “Key Steps”. Someone viewing that page in search results would have no way of knowing what that page is about. Better would be something like “Preventing Pressure Ulcers: Key Steps | National Library of Medicine”. This both describes the content and indicates the authority of the material.
The nih.gov home page, for instance, uses Flash to rotate through several headlines with descriptive text (such as the Salmonella description shown below) that is invisible to search engines.
When you search for the exact text string from the Flash image, the nih.gov home page with the text isn’t returned. The reason why becomes clear when you view the page in Google’s cache. This enables you to see the page exactly as Google has crawled it.
Understand the fundamentals of search engine friendly web architecture
Lots of resources exist, including my sites Jane and Robot and Nine By Blue, and the search engines also provide much of this information for free (see for instance, google.com/webmasters). A quick peek at the medicare.gov robots.txt file, for instance, shows that they’ve implemented directives that the search engines don’t recognize:
#wait 30 seconds before starting a new URL request default=30
#index this site between 1AM – 5AM EST
#limit concurrent active URLs to 2 for each index server
If they’re worried about the search engines crawling them too much for their servers to handle, they should use the crawl-delay directive and set the crawl rate to slower in Google Webmaster Tools.
This post provides just a few examples of how a more crawlable site architecture can make a big difference in how findable the content is. Government sites face many of the same issues that large corporations do: many sites exist, all managed by different departments, no clear process for cross-department collaboration, and lack of global standards. But government and companies alike can evaluate their sites for crawlability in a step towards a more findable web.