Scale and complexity call for leaving it to specialists
As applications move from onpremise to SaaS, the scale of deployments increases by orders of magnitude (to “webscale”). At the same time, application development and operation become tightly integrated and continuous deployment brings the frequency of updates down from months to days or even hours.
The larger scale makes the health of SaaS applications mission-critical and even existential to its providers, while the frequent updates increase the risk of failures. Therefore, monitoring and root cause analysis also become mission critical functions, and more instrumentation is needed to ensure the application’s quality of service. At the company I co-founded, we see customers using extensive and often tailored instrumentation that generates massive amounts of data (think hundreds of thousands of data streams and billions of data points per day).
Current tools make collection and visualization easier but don't reduce work
New tools are raining down on system administrators these days, attacking the “monitoring sucks” theme that was pervasive just a year ago. The new tools–both open source and commercial–may be more flexible and lightweight than earlier ones, as well as more suited for the kaleidoscopic churn of servers in the cloud, making it easier to log events and visualize them. But I look for more: a new level of data integration. What if the monitoring tools for different components could send messages to each other and take over from the administrator the job of tracing causes for events?
The Havana release features metering and orchestration
I talked this week to Jonathan Bryce and Mark Collier of OpenStack to look at the motivations behind the enhancements in the Havana release announced today. We focused on the main event–official support for the Ceilometer metering/monitoring project and the Heat orchestration project–but covered a few small bullet items as well.
Velocity 2013 Speaker Series
Building a high quality alerting system often feels like a dark art. Often it is hard to set the proper thresholds and it is even harder to define when an alert should be triggered or not. This results in alerts being raised too early or too late and your colleagues losing faith in the system. Once you use a structured approach to build an alerting system you will find it much easier and the alerts more predictable and precise.
First you have to select proper measures to alert on. This selection is key as all other steps depend on using meaningful measures. While there seems to be an infinite number of different measures, you can categorize them into three main categories:
- Saturation measures indicate how much of a resource is used. Examples are CPU usage or resource pool consumption.
- Occurrence measures indicate whether a condition was met or not. A good example is errors. These measures are often presented as a rate like failed transactions per seconds.
- Continuous measures do not have a single value at any given point in time, but instead a large number of different values. A typical example is response times. Irrespective of how small you make the sample, you will always have a large amount of values and never just one single representative value.
In preparation for the upcoming Web 2.0 Summit I am posting a few conversations with attendees that embody the Web Squared Theme. Path Intelligence uses sensor technology to understand shopping behavior in retail spaces by detecting and tracking the RF signals from mobile phones. As Sharon Biggar, co-founder, succinctly puts it – “we are like Google Analytics for the real world” giving offline retailers the same visibility on shopping behavior that online retail has enjoyed for years.
We’re quite addicted to data pr0n here at Flickr. We’ve got graphs for pretty much everything, and add graphs all of the time. -John Allspaw, Operations Engineering Manager at Flickr & author of The Art of Capacity Planning One of the most interesting parts of running a large website is watching the effects of unrelated events affecting user traffic…
Javier Soltero just launched CloudStatus during his Hyperic sponsor session today at Velocity. CloudStatus is a public health dashboard for web services like Amazon's EC2/S3, and Google's App Engine. Javier called to tell me about this last week after I declared that "Service Monitoring Dashboards are mandatory". This comes right after Amazon and Google had visible outages, and couldn't have…
Google App Engine went down earlier today. GAE is still a developer preview release, and currently lacks a public monitoring dashboard. Unfortunately this means that many people either found out from their app and/or admin consoles being unavailable or from Mike Arrington's post on TechCrunch. Google has a strong Web Operations culture, and there are numerous internal monitoring tools in…