Sat

Jun 14
2008

Jesse Robbins

Jesse Robbins

Understanding Web Operations Culture (Part 1)

“You don’t choose the moment, the moment chooses you. You only choose how prepared you are when it does.” - Fire Chief Mike Burtch

(Note: I became a Firefighter-1 and EMT in 2000. My experiences in the fire service profoundly influence my efforts in technology. Much of my work over the past few years has been translating and distilling my knowledge from these two worlds, teaching others, and finding ways to apply it in the service of both.)

Last week I came upon a truck vs. scooter accident on my way home. I could hear a woman yelling in pain from underneath the truck (a good sign!) and could see a guy in the cab looking panicked and touching his controls. I stopped my car and “surveyed the scene” looking for things that might kill me (traffic, hazmat, downed power lines) or make the situation worse if undetected (additional victims, deflating tires, fires).

It looked like the driver was about to move his truck, which would have definitely made things worse. I used my ‘command voice’ to yell “Put it in park! Stop your engine! Set your brake! Get out and wait!” as I approached the truck.

A city crew came over, and one of them told me “We’ve called 911 and they are on their way.”

I asked them to handle traffic control as I approached my patient. I then introduced myself and asked her if I could help. (I have to obtain consent before assisting an injured person, and a response means I know they have still have their Airway, Breathing, and Circulation intact.)

Her legs were entangled in her scooter which was trapped underneath the truck. While she probably had broken her leg, it didn’t look all that bad. She was still wearing her helmet and it wasn't seriously damaged which meant her head was probably okay too. I did a quick check for bleeding and other serious injuries and did a “mental status check” by asking her name, where she was (“on my way to school”), and what had happened (“I was riding and that a**hole RAN OVER ME!”). This meant she was alert and oriented, which was good.

Now that I was sure there weren’t any other life threatening injuries, I prepared to hold her head for c-spine stabilization. (Once you start holding stabilization, you cannot move again until you are ready to put the patient on a backboard.)

As I positioned myself on the ground and took hold of her head, I explained “I’m going to hold your head now to protect your neck and back. Once the fire department gets here, they are going to get your legs unstuck and then we’ll get you on a backboard. Your job is to keep still and keep talking to us. There will be a lot of commotion and noise around you, and that’s okay. Everyone will be watching out for you and so there is no reason to be scared. We’ve got you.”

As the fire department arrived they too surveyed the scene and I gave my quick report to the medics. They freed her legs and we transferred her to a backboard. I was released from the scene just as they started removing her helmet, and never even saw her face.

Why am I telling you this story?

I’m telling this story to illustrate how Operations culture works and to provide a little insight into how it is created.

The city workers showed up, called 911, and made it safe for me to treat the patient by controlling traffic. I stopped the truck driver from further injuring the patient and stabilized her until the fire department and medics arrived. The medics took her to the hospital ER where she was probably treated and released.

This is exactly how things should have gone in this situation. It happened because of people with a common desire or duty to act, training on how to act, and experience actually doing it. This is the essence of effective Operations culture.

What does this have to do with Web Operations?

Organizations that depend on the web will die if their site crashes and they don’t recover. The longer the outage, the worse the damage often is. The same kind of Operations culture is required to effectively respond to, recover from, and prevent outages.

While this seems obvious for many people with years of experience working on the web, it is a significant and often difficult shift for those in the mainstream. This seems particularly true for executives who think of Web Operations as an extension of corporate IT. This gap becomes especially painful when people accustomed to traditional “command-and-control” management styles and models try to apply it to this new type of organization.

The CEO cannot shout or fire the website back up. The CFO cannot account, control, or audit the website back up and the Chief Counsel cannot sue it back to life. The CMO, if there is one, and their entire marketing & PR team will not spin a website back online. The CIO or CTO probably can’t recover the site either, at least not very quickly. The fate of the company frequently and acutely rests in the hands of engineers who do Web Operations.


If you're interested in Web Operations you should attend Velocity on June 23-24th.


 
Previous  |  Next

0 TrackBacks

TrackBack URL for this entry: http://blogs.oreilly.com/cgi-bin/mt/mt-t.cgi/6551

Comments: 11

  Andrew Clay Shafer [06.14.08 07:01 PM]

Jesse,

I love the relevant analogy and the conclusion about the dependency of organizations.

Hopefully I'm not stealing your thunder for part II, but I would extend the analogy and argue that an organization can become dependent on this type of 'hero' work.

The organization where web infrastructure becomes a codependent cycle of near disasters with heroes pulling 'all nighters' is more common than most realize or care to admit.

I've indirectly commented about why I think that can happen here:
http://refresh.gigaom.com/2008/06/12/the-craft-automation-and-scaling-infrastructure/

Obviously, accidents happen. The need for expert skill and quick thinking will never go away, but all accidents are not created equal. Many organizations perpetuate their problems precisely because the operations team is good at fire fighting.

It gets ingrained in the culture.

Andrew Shafer
Reductive Labs

  Andrew Clay Shafer [06.14.08 07:35 PM]

Oh yeah, see you at Velocity!

  Fred Moyer [06.14.08 08:11 PM]

Nice read. Genuine fire drills (not all nighters to make deadlines) happen even with the best of contingency planning in the best organizations.

Success brings fire drills more often than failure. Extensive planning up front to avoid operations problems often leads to systems which are complex enough to house many more states of potential failure than simple systems.

However, repeat problems of the same nature are often indicative of failure of operations to address the cases they have seen before.

  John Allspaw [06.14.08 08:46 PM]

Jesse - Excellent post. When looking for *good* web ops people, it's exactly this sort of deliberate, quick, and precise thinking that separates the good from the bad. I think your analogy is spot-on.

I will agree with Andrew that automated infrastructure help in 'crisis' situations, for the consistency and efficiency needed in executing decisions a team needs to make in the heat of the crisis moment.

But I will also say that while the efficient execution of each step is obviously paramount, I would underscore communication and coordination to be just as important as execution.

Anyone who has been part of a team working hard on getting a website back in functioning order will recognize that without proper coordination and communication amongst the members of the team, production troubleshooting and incident mitigation can turn into an unnecessarily long nightmare.

I'm looking forward to seeing Brent Chapman's talk on these sorts of issues.

  Ben Kepes [06.14.08 09:52 PM]

Jesse

Great post - as an aside it would seem that you have some EMS training - or some inside knowledge. I don't know about you but it's rare to find people with an insite into the Web 2.0 space who also understand the rescue services world.

Anyway - great post and I'll look forward to the next part!

b

  orcmid [06.14.08 09:56 PM]

Fascinating account of first-on-scene, first-responder behavior.

I disagree about the hero business, as I am sure you will correct as you go. This is an account of ordinary people who are trained to respond methodically in extraordinary circumstances. As you know there are companies throughout California that have volunteer Emergency Response Teams that train and practice for this kind of situation (although I assume you have EMT training for what you did).

I can imagine operations-control arrangements for Internet/Intranet security penetrations and for loss of access to critical business systems, the sort of thing that involves bringing the business up in a new location on different systems. I think you can tell whether they are really in place by whether or not full-up drills are ever conducted. I can imagine other emergencies that might fit, even dealing with a bad rumor.

I'm curious for you to account for the practice that brings you and others to the place where you can respond in this way. How do you see the necessary decentralization and delegation of ability to respond being applicable in regard to digital-system operations?

  Jesse Robbins [06.14.08 11:14 PM]

Yes, I am a firefighter-1 and recently expired EMT. Thank you all for the comments. I'll respond in detail tomorrow.

  Mahesh [06.15.08 12:23 AM]

I loved the first part of your story. I didn't know about you till I read this post in my reader. Based on how you have put it, I figured you were more of a doctor. The first part of your story is exceptionally well written and very riveting. I would much appreciate if you could find out and tell us all (now that you have got us so involved in the story), that the girl is ok now.

I hated the second part of the story. I hate FUD. And I hated how in a moment you used your fantastic heroic act to sell some cheap little conference.

Sincerely, the first part of the message made you a superhero in my mind, and I am sure in most readers minds, (not many people in todays day and age will go out to help others, no matter how well trained they are, if there's no money involved). The second part made you a pimp! Please don't cheapen yourself or your heroic acts. You are a good man, please stay that way.

  Andrew Clay Shafer [06.15.08 11:05 AM]

@Mahesh

I'm sure Jesse didn't mean to offend your sensibilities. Though to be honest, after reading the title and the first quote, I'm not certain how you found the ending to be a surprise.

Further, the life of a company can be quite literally in the hands of the operations team. Failure to execute can close the doors, resulting in a loss of income and real strife to all involved.

I don't mean to belittle your feelings. I feel the analogy was sensational, but it was also relevant.

@Fred Moyer

It's not a fire drill if there is a real fire.

Let's reframe the analogy just a little bit. (hopefully if we take out the acute trauma others will find it more palatable)

I think you can put most the problems an operations team will face into three buckets, lightening strikes (hardware failure, etc.), accidental fires (human caused) and floods (the problems of success, digg effect, scale, etc.).

Resolving any of these problems effectively depends on the operations team having enough skill, insight and communication to assess, respond and escalate until there is a resolution.

Imagine the outcome of the original story, if the fire department never comes. Scary. . .

Obviously, there is nothing we can do to stop lightening, but we can build fire breaks and have a plan to respond. We all secretly hope to have the opportunity to respond to floods, but again preparation is another thing.

The main point I was trying to make is that many organizations have an inordinate amount of accidental fires and are ill prepared for lightening or floods precisely because that is the 'tradition'.

I don't want to put words in his mouth but I believe Jesse is making a similar point here:
http://radar.oreilly.com/archives/2007/10/operations-is-a-competitive-ad.html

To your points about planning and all nighters. I believe the approach with the biggest advantage has more to do with day to day methodology and willingness to invest in infrastructure than a big complicated plan up front. Finally, all nighters to meet deadlines is almost always the result of poor planning and communication in addition to being a great strategy to get some accidental fires to clean up.

@John Allspaw

Absolutely agree, nothing turns a manageable emergency into an all night clean up effort faster than more than one person or group trying to respond without communicating.

@orcmid

I don't want to minimize Jesse's points about trained first responders and the effective chain of skill and responsibility in an emergency, because I think they are great points.

But, if you don't realize there is a pervasive 'hero' mentality in operations culture, then either 1) you haven't been in or around operations or 2) you are one of the 'heroes'.

@Jesse

Looking forward to your response. . .

  Jesse Robbins [06.15.08 12:27 PM]

@Andrew

It seems that some organizations go from ignoring danger, to crisis, to being saved by a few "heros", and then they get stuck for a while in the cycle you describe. Breaking out of the cycle requires a big shift in thinking and culture.

@Fred Moyer

I think you mean "Fires", not "Fire Drills"! Drills & Exercises are the only way to get experience before you have an actual fire. I'll be writing about this in one of the next posts.

@John Allspaw

Thanks! Automated Infrastructure is part of the solution for WebOps, just as Fire Sprinklers (another kind of automated infrastructure) is part of the solution for Buildings.

@Mahesh,

What I did last week was simply a function of my training and desire to serve others. I'm not a hero of any kind, nor was my act heroic. I am part of a culture of people who help others. This culture includes the millions of people who have taken a CPR/AED class, community disaster teams, volunteer & professional first responders, nurses, doctors, public leaders, and everyone in between.

The analogy I used and the shift it represents is very real. I could have spent more time bridging the two ideas. I'm sorry that it appeared to be FUD to you. It is not.

My experiences as a Firefighter/EMT and Emergency Manager profoundly influence my work as an engineer and manager. Much of my work over the past two years has been translating my knowledge from these two worlds, teaching others, and finding ways to apply them in the service of both.

  Fred Moyer [06.17.08 09:22 AM]

@Jesse Robbins

The intention was more of "Fires" there, not "Fire Drills". It has been the overloaded terminology used where I have worked, perhaps to differentiate actual combustion events (which would necessitate an exodus of people from the building) from web operations emergencies.

Maybe the term 'Web Operations Incident' is more appropriate? The situations I have always had to respond to for web operations incidents have always been termed 'fire drills'.

Post A Comment:

 (please be patient, comments may take awhile to post)






Type the characters you see in the picture above.

RECOMMENDED FOR YOU

RECENT COMMENTS