Julie Steele and I recently had lunch with Etsy’s John Allspaw and Kellan Elliott-McCrea. I’m not sure how we got there, but we made a connection that was (to me) astonishing between web operations and medical care for premature infants.
I’ve written several times about IBM’s work in neonatal intensive care at the University of Toronto. In any neonatal intensive care unit (NICU), every baby is connected to dozens of monitors. And each monitor is streaming hundreds of readings per second into various data systems. They can generate alerts if anything goes severely out of spec, but in normal operation, they just generate a summary report for the doctor every half hour or so.
IBM discovered that by applying machine learning to the full data stream, they were able to diagnose some dangerous infections a full day before any symptoms were noticeable to a human. That’s amazing in itself, but what’s more important is what they were looking for. I expected them to be looking for telltale spikes or irregularities in the readings: perhaps not serious enough to generate an alarm on their own, but still, the sort of things you’d intuitively expect of a person about to become ill. But according to Anjul Bhambhri, IBM’s Vice President of Big Data, the telltale signal wasn’t spikes or irregularities, but the opposite. There’s a certain normal variation in heart rate, etc., throughout the day, and babies who were about to become sick didn’t exhibit the variation. Their heart rate was too normal; it didn’t change throughout the day as much as it should.
That observation strikes me as revolutionary. It’s easy to detect problems when something goes out of spec: If you have a fever, you know you’re sick. But how do you detect problems that don’t set off an alarm? How many diseases have early symptoms that are too subtle for a human to notice, and only accessible to a machine learning system that can sift through gigabytes of data?
In our conversation, we started wondering how this applied to web operations. We have gigabytes of data streaming off of our servers, but the state of system and network monitoring hasn’t changed in years. We look for parameters that are out of spec, thresholds that are crossed. And that’s good for a lot of problems: You need to know if the number of packets coming into an interface suddenly goes to zero. But what if the symptom we should look for is radically different? What if crossing a threshold isn’t what indicates trouble, but the disappearance (or diminution) of some regular pattern? Is it possible that our computing infrastructure also exhibits symptoms that are too subtle for a human to notice but would easily be detectable via machine learning?
We talked a bit about whether it was possible to alarm on the first (and second) derivatives of some key parameters, and of course it is. Doing so would require more sophistication than our current monitoring systems have, but it’s not too hard to imagine. But it also misses the point. Once you know what to look for, it’s relatively easy to figure out how to detect it. IBM’s insight wasn’t detecting the patterns that indicated a baby was about to become sick, but using machine learning to figure out what the patterns were. Can we do the same? It’s not inconceivable, though it wouldn’t be easy.
Web operations has been on the forefront of “big data” since the beginning. Long before we were talking about sentiment analysis or recommendations engines, webmasters and system administrators were analyzing problems by looking through gigabytes of server and system logs, using tools that were primitive or non-existent. MRTG and HP’s OpenView were savage attempts to put together information dashboards for IT groups. But at most enterprises, operations hasn’t taken the next step. Operations staff doesn’t have the resources (neither computational nor human) to apply machine intelligence to our problems. We’d have to capture all the data coming off our our servers for extended periods, not just the server logs that we capture now, but any every kind of data we can collect: network data, environmental data, I/O subsystem data, you name it. At a recent meetup about finance, Abhi Mehta encouraged people to capture and save “everything.” He was talking about financial data, but the same applies here. We’d need to build Hadoop clusters to monitor our server farms; we’d need Hadoop clusters to monitor our Hadoop clusters. It’s a big investment of time and resources. If we could make that investment, what would we find out? I bet that we’d be surprised.