Service Monitoring Dashboards are mandatory for production services!

Google App Engine went down earlier today. GAE is still a developer preview release, and currently lacks a public monitoring dashboard. Unfortunately this means that many people either found out from their app and/or admin consoles being unavailable or from Mike Arrington’s post on TechCrunch.

Google has a strong Web Operations culture, and there are numerous internal monitoring tools in use across the company, along with a smaller set available to customers. It’s suprising that Google launched a developer platform without providing something beyond an email group, although they are by no means the first to do so.

google-app-engine-needs-a-dashboard.png

Service Monitoring Dashboards are mandatory for production services and platforms!

  • If you launch a platform that people pay you money for, you need to have a real time service dashboard. Ideally this should be decoupled from the rest of your infrastructure.
  • Don’t rely on platforms that lack service monitoring dashboards for production.

Many companies are initially reluctant to provide this kind of monitoring to the public, and only do so in reaction to an outage. However, it seems that every company that offers such a dashboard uses it as a source of competitive advantage.

The best example of this is trust.salesforce.com which they launched after series of outages in 2006. Amazon (eventually) launched a status dashboard for AWS, and added RSS feeds for specific services which I think is pretty cool.

AWS Service Health Dashboard - Jun 17, 2008.png

Javier Soltero at Hyperic points out

1. The reports of service outages arrive long after anyone who depends on the services can possibly do anything to mitigate their effect.
2. The services themselves seem incapable of providing any visibility into the circumstances that might lead to future outages.

[...]Even TechCrunch points out that the Google Apps blog doesn’t even mention the outage. Other clouds rely on blogs such as this one, this one, or maybe even this one (from our good friends at Mosso). These are all places where outages can be discussed, but not the right means for people to find out whether it their application that crashed, or the cloud that it depends on.

(Updated:Niall Kennedy pointed out that GAE is still a preview release, and I agree that my original wording was wrong. My intent is to emphasize the importance of providing a public service dashboard and so I’ve edited accordingly.)

tags: , , , , , , , , , , , , ,
  • http://www.niallkennedy.com/ Niall Kennedy

    Google App Engine is currently in preview release stages. No charge, no SLA, risky for a production service. Two weeks ago they allowed more developers through the doors to stress test the system and it looks like someone tripped a BigTable issue this morning.

    This outage was the result of a bug in our datastore servers and was triggered by a particular class of queries.

    Looks like a rogue query brought down the whole BigTable structure causing 502 errors. They also pushed a client update this morning and it looks like they had many external devs testing the bulk loader (write heavy).

  • http://radar.oreilly.com/jesse/ Jesse Robbins

    Thanks for pointing that out Niall. I’ve edited and toned down my post.

  • http://www.1060.org/blogxter/publish/5 Steve Loughran

    IMO, the dashboard and feed should be in a separate domain. Because if your DNS goes down, that’s the kind of thing you need to provide news on. This is what burned SourceForge in their big datacentre outage -the dashboard disappeared too.

  • http://www.hyperic.com Javier A. Soltero

    Just as a followup, Hyperic launched http://www.cloudstatus.com at the Velocity conference to address this very need. We’ll be adding additional visibility into AWS and additional clouds over the coming months.

  • http://www.realuserwebops.com Tim T

    Most of the posts seem to look at having this service level dashboards for cloud-type services, such as Google or Amazon. Of course a great idea.

    I haven’t seen widespread adoption of that type of end-user performance dashboard for just regular services within a typical data-center.

    Thoughts?

    I am blogging about these types of conversations and observations at http://www.realuserwebops.com

    I would love to hear other people’s thoughts….not sure I get pinged when people comment on this blog.

  • http://eedious.blogspot.com friarminor

    As much as we dread to hear of outages, you’d have to be prepared to face the fact that these things will continue to happen and exploring alternatives to provide users and the public as well with monitoring tools will likely be a key factor in the adoption of cloud compute services.

    You may want to take a look at what we’ve set up at Mor.ph for our Morph Appspaces

    Best.
    alain.