Google Web Accelerator considered overzealous

Some users of 37 Signals’s new Backpack web application started noticing yesterday that their backpacks had been rifled through and a page here and there had simply disappeared. A little digging found Google’s new Web Accelerator to be the culprit.Writes Jason Fried:

The accelerator scours a page and prefetches the content behind each link. This gives the illusion of pages loading faster (since they’ve already been pre-loaded behind the scenes). Here’s the problem: Google is essentially clicking every link on the page — including links like “delete this” or “cancel that.” And to make matters worse, Google ignores the Javascript confirmations. So, if you have a “Are you sure you want to delete this?” Javascript confirmation behind that “delete” link, Google ignores it and performs the action anyway.

Since the same could be said of about any web spider or bot — including the Googlebot every new site owner wants so desperately to visit — why then is Web Accelerator any different? Spiders all meander about in about the same manner, traversing the Web by visiting pages and clicking links recursively, link to page, link to page, ad nauseam.

The difference lies in just who is doing the the clicking.

Web-dwelling spiders are typically locked out of a web application’s personal view: the view with all those administrative links like “Delete Entry,” “Add Item,” and “Drop Table.” The Google Web Accelerator, on the other hand, sees the web as you do — administrative warts and all. As an example, take a gander at a sample phpMyAdmin view (a web application for managing your MySQL database) and notice all those red Xs. If you clicked, for instance, “Drop,” you’d be dropping the entire database table at hand. But not without a popup JavaScript confirmation: “Do you really want to DROP TABLE…” which would send most people into shock, followed by a quick click of the “Cancel” button, if they’d not meant to click the “Drop” button. The Web Accelerator summarily ignores this warning (actually, it most likely doesn’t even notice it, nor could it likely be taught to understand such confirmations in any reliable automated fashion). And this spider is doing all this clicking preemptively, prefetching anything within your purview you might actually chance to click on in the near future.

While there’s much hay to be made about the inapplicability of simple GET links (i.e. your run-of-the-mill hyperlink) to actions that result in change (i.e. deleting rather than simply visiting something), it is well known that web applications in the wild often don’t follow those safety standards. PHPMyAdmin, mentioned in the previous paragraph, is rife with potentially destructive GETs. Even Google’s own Blogger weblog application has its share of destructive GET actions; comments Anil Kandangath: “To see a dangerous use, you have to look no farther than Google’s own Blogger. If you post a comment on a blogger weblog, and if you are logged on, you can see a delete icon near your comments. If you are the owner of the weblog, you can see the delete icon near *all* the comments.” I wonder if Blogger users have noticed their comments disappearing? (And if not, why not?) A quick look through Gmail finds pretty well everything potentially inbox-altering sitting behind a nice safe POST.

Yes, one could argue that only “badly designed” web applications that don’t follow the rules of GET and POST will be affected, but I’m not sure this is an argument that Google (or anyone else who actually builds or uses web apps in the wild) would care to make in this situation.

(Not to mention the purse annoyance value of your Web Accelerator clicking all those “Logout”/”Sign Out” links on the sites you visit — and those certainly are not usually seen as POST-worthy.)

There are some rules with regard to prefetching, but but as 37s SVN reader “matthew” comments: ‘It appears that google is going past the standards of “prefetching”, at least as described by mozilla. that faq makes a point that “URLs with a query string are not prefetched” and “https:// URLs are never prefetched for security reasons”’. (That said, Ruby on Rails apps like Backpack do format queries that might usually be appended to URLs using a ? as fully formed URLs like http://username.backpackit.com/pages/blank_slate.) Now a webmaster FAQ on the Google Web Accelerator site (and when I say “on the site” I mean that it was on the site and seems, at the time of this writing, to have disappeared) does suggest how you might make some suggestions to the Web Accelerator about what _should_ be prefetched (applying a rel=”prefetch” attribute to those links), but not how to specifically say not to prefetching. If you’re running an Apache server, Shane Allen suggests some rewrite magic based on the HTTP_X_MOZ header; you should find something apropos for your particular server and application combination.

Until this is all sorted out, if you build or host web applications, you might want to take some of the precautions being bandied about today. Here’s one for Ruby on Rails apps. And there are various embedded in the discussion following the 37 Signals blog post on the issue.

In the meantime you can keep up with the conversation on the 37 Signals blog and Web Accelerator Google Group.