The new stage of system monitoring is better integrated

New tools are raining down on system administrators these days, attacking the “monitoring sucks” theme that was pervasive just a year ago. The new tools–both open source and commercial–may be more flexible and lightweight than earlier ones, as well as more suited for the kaleidoscopic churn of servers in the cloud, making it easier to log events and visualize them. But I look for more: a new level of data integration. What if the monitoring tools for different components could send messages to each other and take over from the administrator the job of tracing causes for events?

Currently, each piece of information about system resources, network activity, and applications is isolated. Typically, a system administrator has to respond to a problem with a workflow like this:

Find the activity that triggered an alert–for instance, a failed application.
Run through some sanity checks: make sure the host where the activity was running is running, look at the application’s log files to see what preceded the failure, etc.
Look for related problems, such as whether the CPU was overloaded on the host or whether the application lacked access to a critical directory. Repeat this step until you are satisfied you have found the cause of the problem.

How much effort would it take to link these tasks together so you can move quickly from one level of forensics to another? I don’t think we’d need fancy AI or machine learning, although those are on the way. Sure, it would be nice someday to bark into one’s tablet “Show me what caused nginx to fail on host Rupert” and have a set of charts come up, but we don’t have to wait for that.

System administrators love visualization tools, with the open source Graphite being their main squeeze at the moment, along with a number of commercial solutions. A graphical interface that lets you jump to sensitive issues is also valuable.

For instance, Boundary.com lets you browse through events and click on one for details, then drill down into details. Boundary.com focuses on application traffic. Among its uses, according to Director of Technical Services Dustin Lawler, are checking the capacity of services such as Hadoop for streaming data, and checking how new services or changes affect service, such as continuous integration. They also track events, which they correlate the application traffic behavior to understand the causes and impacts. This is a big data solution, and Boundary.com uses some modern data stores to handle the volume. Splunk offers a Hadoop service for similar reasons.

To improve the seamlessness of monitoring even further, I think we need advances in two key areas : data collection and linkages.

Data collection

To forge an integrated monitoring tool, each component (applications as well as operating systems) would have to log a lot more data. Operating systems provide rich data about disk usage, network usage, and all kinds of stuff, whereas applications tend to be sparing in what they log. Both need to store more data because a lot of important administrative problems require longitudinal data–a record of what’s happening over many hours.

Baron Schwartz, a MySQL performance expert who recently founded VividCortex, tells me that this yearning points toward a solution called Configuration Management Database (CMDB), which has not yet been implemented effectively. I am looking toward something like this, but distributed among many servers and applications. I don’t think it makes sense to store all this data in some central node. If applications and operating systems can send messages with relevant data (buffer sizes, allocated memory, etc.) to a monitor on the local node, it could store the data in some compressed format. The system administration field therefore needs to define some standard protocol and format to give a uniform look to all the data.

Alternatively, a library could be provided for local applications to store their own data. For instance, a database could keep a log of buffer usage and the length of time queries took.

The raw data will be valuable for some investigations, but for practical reasons we have to let go of it eventually. Each app or other component that tracks data should create summaries (such as peak use, rate of change in traffic, etc.). It is these summaries that are sent to a central administrative tool. The system administrator can, if he gets desperate, go back to the raw data on a particular node if the summary data isn’t enough to debug a problem.

For example, a web server may log each request, but generate summary data in the form of how many requests were made to a particular site or domain, and how many requests were handled each second.

So an administrator should be able to configure:

What kinds of data should be recorded from each application, and at what resolution
How long the raw data should be preserved
What summary data should be generated from the raw data
How long the summary data should be preserved

Much of the data collected to maintain performance and availability are also useful for security. A security expert may choose to rely on much of the data collected for monitoring, and collect different summary data from it.

For instance, the security expert may maintain a list of suspicious IP addresses and ask whether they made requests of the web server, or look for suspicious patterns of connection and disconnection. (I need to do a little marketing here, because O’Reilly will soon release a book titled Network Security Through Data Analysis that covers such techniques.) In general, security is such a specialized function that I won’t try to integrate it with the subject of this article.

Linkages

Now we have to consider how to conveniently move between different views of a network or host. When an application slows down drastically, an administrator should be able to call up charts of CPU and memory usage on the system. This suggests that the monitoring system needs to know that the database management system’s buffer usage is related to the operating system’s memory usage–something that’s obvious to a trained person, but that needs to be made explicit to a computer.

To turn data into knowledge, one wants to submit the relevant data from many different subsystems to an agile visualization tool–perhaps an animation that can change to reflect the evolution of the statistics you’re honing in on. You’ll need to define rules such as “when this application uses up a certain amount of buffer space, generate a chart of operating system memory usage.”

An SaaS solution, Librato, may be providing a streamlined reporting mechanism along the lines I suggest here, along with a RESTful API. Librato can aggregate statistics from many independent sources. up.time, from uptime software, is another comprehensive and unified monitoring tool, promising this type of integration for all servers, applications, IT services, and the network.

Fred van den Bosch of Librato claims (and I’ve heard others back him up) that the future of analysis and storage of monitoring data lies in SaaS tools. I’ll let him expound this in his own blog posting, but his essential reasoning is this: data collection is specific to the (application and infrastructure) resources being monitored and can effectively be done with open source tools and direct instrumentation of applications. But reliable aggregation, analysis, and storage of large volumes of monitoring data is so complex that administrators will increasingly prefer to depend on SaaS solutions built by specialists.

This is all fine with me so long as tools and APIs are developed along open source lines. Otherwise, the field of monitoring splinters and various firms would compete using less capable offerings than the ones collaboration could produce.

One obvious situation where you’d want to quickly see linkages is the common situation of cascading failures, where you discover that your web server crashed…because the underlying host crashed…because the cloud platform you were depending on crashed. System administrator Luke Tymowski pointed out this scenario to me.

And these rules have to be conveyed in a structured manner to the monitoring tool. For instance, let’s say you want to show how much free memory the operating system had at various measured points over the past 5 minutes whenever the InnoDB buffer pool has less than 1 of megabyte unused space. I imagine a configuration syntax like:

when
  mysql::innodb::pool+1mb > mysql::innodb_buffer_pool_size
get os:free for n = now-5min to now ;

The syntax might require a host name as well, but perhaps the parser can just apply the request to any host running MySQL, which it could learn from another source of configuration information. The parser would then assume that the “os” to get information from is the operating system running the affected MySQL server.

Now imagine how this request is carried out.

Initialization

Central monitoring software tells the MySQL database that the buffer pool size and the amount of free space it has from moment to moment are important variables. MySQL already keeps track of free space (reporting it to the administrator through various system variables, such as Innodb_buffer_pool_pages_free), so this is not a difficult requirement to fulfill.Note that MySQL would not, in this scenario, report the free space or size to the central monitoring software; MySQL just uses those data points internally to calculate its trigger. However, MySQL must speak the same protocol as the central monitoring software and must support the request. If it doesn’t, the central monitoring software reports an error to the user during initialization.

The central monitoring software sends a similar request to the operating system to make sure it can report free memory. This time, the information actually will be returned to the central monitoring software.

Of course, a server could fail catastrophically and stop collecting or reporting information without warning. This must be caught at another layer of heartbeat monitoring.

Trigger

For our system to work, MySQL must maintains its own monitoring software, which of course is intimately tied to its internal code. MySQL is always monitoring the memory levels already in order to take such actions as flushing the pool. To fulfill our memory check, MySQL must send out a signal when the InnoDB buffer is squeezed for memory.Perhaps the signal will go through a standard protocol directly to the operating system, with just a notification to the central monitoring software that it took place. Or perhaps MySQL will notify the central monitoring software, which in turn will make a request to the operating system. In either case, the operating system must be listening for a signal (which may come over an Internet socket from central monitoring software on a remote host, so dedicated sockets must be defined for this communication).

Response

The operating system has already been asked to keep a table of free memory over regular intervals. It knows that it can discard any information over 5 minutes old. When the information is requested by the central monitoring software, the operating system obliges with an informative response. Although this simple example requires sending the raw data, other scenarios may have the OS compress the data in some way, such as providing an average.The central monitoring software can use a popular visualization tool to show the data to the administrator, or even take independent action. In any case, the administrator can quickly learn that the MySQL server is running out of buffer space and simultaneously view the memory usage on its operating system.

We’re talking about a lot of standards, but everything I’ve suggested involves data that is readily available. An architecture like this could take a lot of fumbling and delay out of the system administrator’s life. And while my example showed a very simple trigger, lots of longitudinal data could be tracked and used to warn administrators of developing problems.

The administrator has to be selective in setting up triggers and requesting information. It may be infuriating to find that you carefully selected the wrong information to get back. But every incident is a learning experience, and can be incorporated into the next round of monitoring.

If such a monitoring system gets set up, the community will quickly provide recipes and plugins for common activities, such as they have long done for Nagios. Adoption could snowball, thanks to network effects. As more servers and systems are updated to recognize the communications protocol and record the necessary data, the software gains value and attracts support from even more servers.

The urgency of optimizing procedures for monitoring

The growth of virtualization and high-availability architectures are already straining the ability of administrators to notice when something is about to bust.

Tymowski points out to me that the simple one-server scenario was replaced long ago by tiers of load balancers, caching servers such as Memcached, and multiple data sources intricately combined for delivery over the Web.

As Schwartz says, administrators who used to manage a couple physical machines are now responsible for hundreds of virtual servers. Data and traffic have also grown exponentially, and the speed of code development and deployment has also grown at a dizzying pace. He points out that overeager attempts to measure everything the administrator can think of break down as these sizes all scale up.

Bottling human intelligence, experience and (as often as not) intuition is a daunting goal, and the search for radical advances in monitoring is certainly such a goal. But the building blocks seem tantalizingly at hand.

Thanks to Dustin Lawler, Slawek Ligus (author of Effective Monitoring and Alerting), Baron Schwartz, Luke Tymowski, and Fred van den Bosch for their comments and critiques.

The new stage of system monitoring is better integrated

Current tools make collection and visualization easier but don't reduce work

Data collection

Linkages

The urgency of optimizing procedures for monitoring

Get the O’Reilly Systems Engineering and Operations Newsletter