Starting Small with Great Expectations

Explicit expectations are key to operating at scale

Our lives are rife with expectations.

When we flip the light switch, we expect electrons to flow; when we issue CPU instructions, we expect to get the correct answer; when we look at commit logs in the source repository, we (hopefully) expect tests to accompany them and that our colleagues have run them, pre-checkin. But we’ve all probably been burned by these types of assumptions at some point.

In an operational environment, like the large scale websites and build farms we’re responsible for, these sort of expectations can be a costly cause of errors, and are one of the prime sources of miscommunication. Many a postmortem has uncovered that some expectation the ops team had of the development team was actually an assumption… and we all know that old saying about assumptions and donkeys.

Stated, straightforward expectations are one aspect of the National Airspace System (NAS) that has allowed it to scale to the traffic levels it has, while still increasing its safety margin. In the previous two parts of this series, we looked at the fundamentals of standardizing your operational processes and some techniques to improve communication around those processes. Certainly there is benefit to improved communication and the more predictable outcomes that process facilitates, but being able to leverage these foundations to set expectations is really where an organization can realize the value of a so-called “instrument rating.”

What to Expect When You’re Aviating

A core foundation of the NAS is its operators’ ability to rely on expectations of each other: pilots expect controllers to provide a set of services, communicated in a particular format; controllers expect pilots to comply with their clearances. Both expect the other to announce (and justify) deviations to the extent possible in the moment.

Where the NAS’s concept of expectations gets interesting is when the system is influenced by factors that effect its “steady state”, like inclement weather, equipment failures, or human factors (sick passengers, etc.).

My favorite example of this is all the expectations that go into effect when active communications are lost between a controller and a plane. Remember that the system was designed to complete the entire trip without being able to see the ground or other aircraft; so these rules cover details like what altitude the plane is expected to fly at for the rest of its trip, what routing it’s expected to take, what landing procedure it’s expected to execute when it gets to its destination, and so on. It also sets the expectations that controllers will continue to keep the flight path clear of other traffic along that route, even if it’s crossing the entire country.

A more blatant example: you might hear a controller say something like “Descend and maintain 5,000; expect the I-L-S runway three approach.” This may seem like it’s a hint for the pilot of things to come, but in the event of lost communications, it serves as a proxy for the actual clearance.

Transforming Assumptions Into Expectations

My treatment of expectations thus far might sound like what any organization, software development or otherwise, does every day. After all, we all expect things out of our coworkers.

Admittedly, it’s a bit of a nuanced topic. Like communication, the setting of explicit, actionable expectations is, fundamentally, a cultural shift. It’s a shift to a perspective that values established standardized processes over breaking out improvisation as the first problem solving go-to. This often gets interpreted as
forcing or proscribing a brittle path that developers and ops engineers must follow, effectively hog-tying them. That’s not the case at all, but it’s heard that way because it does involve valuing and agreeing upon limits to operational degrees of freedom.

It holds them responsible for communicating and explaining deviations from those processes and expectations, so others can continue to fulfill their responsibilities to the team. It makes clear whom they need to inform, and it should account for what happens when that’s not immediately possible, as in aviation’s NORDO (no radio) condition.

We certainly already do some of this day-to-day: we post to the team mailing list when we make API or build process changes; we swing by a QA engineer’s desk to let her know she’ll need to test some new functionality. But when we do this, we’re setting expectations informally, to a potentially incomplete audience. Therefore they aren’t expectations: they’re assumptions. And assumptions are fragile.

One way to address this goes back to formalizing the communication expectations and modifications to them: harking back to effective communication, a company I worked with included a “QA Notes” section on their source code commit form, which prompted developers to annotate changes to expectations as part of the record of that source code change, forever accessible, and available to anyone who looked at the log. Some QA departments parsed that field out, and automatically sent updates to engineers who were interested in reading them.

No single solution will be appropriate for every organization or type of software. As all of your teams begin to dissect the assumptions they’ve made between themselves, they can start to come up these types of tweaks that can be streamlined into the workflow. This turns assumptions into actual expectations that people can build automation around and unit test, with the types of reporting that make breaches of expectations loud and obvious, instead of silent as is often the case with broken assumptions.

In Search of Reasonable Outcomes

A common counterargument I hear about relying on standard processes and setting formalized expectations is the belief that it negates improvisation. Nothing could be further from the truth. In fact, it reduces the risk and improves the chances for positive outcomes in situations where improvisation is the best solution.

The quintessential aviation example is the unprecedented decision go “ATC zero” across the entire American airspace on the morning of the September 11th attacks. Air traffic controllers had to scramble to improvise methodologies to accomplish this order as swiftly and safely as possible, under incredible pressure; even worse, their situation included the known existence of bad actors, but none had been identified yet. Controllers rose to the challenge, but to do so, relied on extending the existing standardized processes of the NAS, communicating clearly, and taking full advantage of the power of expectations, both on the ground and in the air.

The goal in such an extraordinary and unexpected incident is to have a “reasonable outcome.” Being able to rely on expected behaviors and an identifiable set of responsibilities facilitates this. It allows system operators to efficiently and accurately extrapolate what is possible during these events. And because there exists an established set of behaviors, deviations can be used to assist in risk assessment, or make it clear when others need to be notified of a change in the expected protocols.

In a similar manner, when we can establish expectations between, say, an ops team and a development team during a crisis situation, those teams can skip mediating that in real time, knowing that the expected responsibilities will be clear (and should already naturally be divvied up, which helps to parallelize problem solving).

Doing this is as simple as establishing roles and responsibilities and agreeing to follow them in a crisis: many of us have had the experience of responding to an incident, only to find a colleague already logged in and tinkering around. In aviation, great care is taken to ensure it’s always clear who is in control of the airplane and the radar scope. That may change in an emergency, but that handoff process is also clear. Really, the only way to embrace this sort of behavior–because it’s natural for everyone to want to get in and help in a crisis–is to drill for it, and get engineers out of the emergency-as-problem-to-solve pattern and into an operational one.

In this way, expectations almost serve as a cultural “unit test” for your processes: if people’s expectations aren’t assumptions, and consistently match the reality of your organizational behaviors, even in critical situations, the test passes and your team is “anti-fragile.”

In the end, formalized expectations really just boil down to a nuanced, but very real cultural transformation that acknowledges the importance of the ability to declare what we’re offering and relying on from others, and that doing so is critical to being able to scale a system in a safe, sustainable way without getting in each others’ way as it grows (or when it falters).

When it comes to expectations, what should teams define when systems falter? We’ll take a look at the specifics next week.

tags: , ,

Get the O’Reilly Web Ops & Performance Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.