Revisiting “What is DevOps”

If all companies are software companies, then all companies must learn to manage their online operations.

DevOpsBirds

Two years ago, I wrote What is DevOps. Although that article was good for its time, our understanding of organizational behavior, and its relationship to the operation of complex systems, has grown.

A few themes have become apparent in the two years since that last article. They were latent in that article, I think, but now we’re in a position to call them out explicitly. It’s always easy to think of DevOps (or of any software industry paradigm) in terms of the tools you use; in particular, it’s very easy to think that if you use Chef or Puppet for automated configuration, Jenkins for continuous integration, and some cloud provider for on-demand server power, that you’re doing DevOps. But DevOps isn’t about tools; it’s about culture, and it extends far beyond the cubicles of developers and operators. As Jeff Sussna says in Empathy: The Essence of DevOps:

…it’s not about making developers and sysadmins report to the same VP. It’s not about automating all your configuration procedures. It’s not about tipping up a Jenkins server, or running your applications in the cloud, or releasing your code on Github. It’s not even about letting your developers deploy their code to a PaaS. The true essence of DevOps is empathy.

By “empathy,” Jeff means an intimate understanding between the development and operations teams. Tricks like co-locating dev and ops desks, placing them under the same management, and so on, are just a means to an end: the real goal is communications between organizations that can easily become antagonistic. Indeed, the origin of our Velocity conference was the realization that, although the development and operations teams were frequently antagonists, they spoke the same language and had the same goals.

In his blog post The Promises of DevOps, Mark Burgess discusses the connection between DevOps and promise theory. Promise theory is a radically different take on management: rather than basing management on a top-down, command-and-control network of requirements, promise theory builds services from networks of local promises. Components of a system (which may be a machine or a human) aren’t presented with a list of “requirements” that they must deliver; they are asked to make “promises” about what they are able to deliver. Promises are local commitments: a developer commits to writing a specific piece of code by a specific date, operations staff commits to keeping servers running within certain parameters.

Promise theory doesn’t naively assume that all promises will be kept. Humans break their promises all the time; machines (which can also be agents in a network of promises) just break. But with promise theory, agents are aware of the commitments they’re making, and their promises are more likely to reflect what they’re capable of performing. As Burgess explains:

Dev promises things that Ops like; Ops promises things that Dev likes. Both of them promise to keep the supply chain working at a certain rate, i.e. Dev supplies at a rate that Ops can promise to deploy. By choosing to express this as promises, we know the estimates were made with accurate information by the agent responsible, not by external wishful thinkers without a clue.

And a well-formed network of promises includes contingencies and backups. What happens if Actor A doesn’t deliver on promise X? It may be counterintuitive, but a web of promises exposes its weak links much more readily than a top-down chain of command. Networks of promises provide services that are more robust and reliable than command and control management pushed down from above. As Tim Ottinger puts it in a pair of Tweets:

That’s the difference between top-down management and promise theory in a nutshell: are you building a machine made of human cogs, or a community of talent?

Burgess is completely clear that DevOps isn’t about tools and technologies. “Cooperation has nothing to do with computers or programming. The principles of cooperation are universal matters of information exchange, and intent.” Cooperation, information exchange, and networks of intent are first and foremost cultural issues. Likewise, Sussna’s concept of “empathy” is about understanding (again, information exchange), and understanding is a cultural issue.

It’s one thing to talk about cultural change and understanding; it’s something different to put it into practice. To make this concrete, let’s talk about one particular practice: blameless postmortems at Etsy. As John Allspaw writes, if a postmortem analysis is about understanding what actually happened, it’s essential to do so in an atmosphere where employees can give an account of events “without fear of punishment or retribution.” A postmortem is about information exchange and empathy (to use Sussna’s word). If we can’t find out what happened, we have no hope of building systems that are more resilient.

Blameless postmortems are all the more important because of another aspect of modern computing. Top-down management has long insisted that, when there’s a failure, it must be traced to a single root cause, which usually ends up being “human error.” But for complex systems, there is no root cause. This is an extremely important point: as we’ve pointed out, all systems are distributed, and all systems are complex systems. And almost all failures are the result of “perfect storms” of unrelated events, not single failures or errors that could or should have been anticipated. As Allspaw puts it, paraphrasing Richard Cook, “failures in complex systems require multiple contributing causes, each necessary but only jointly sufficient.”

DevOps isn’t just about Dev and Ops. It’s about corporate management as a whole.

If you’ve ever worked in a company where the project wasn’t over until the blame was assigned (as I have), you know that the short-term result of “single root cause” thinking is blame and shame for the individual “responsible.” The long-term result is a solution that inevitably makes the organization more brittle and failure-prone, and less agile, less able to adapt to changing circumstances. Without a culture of understanding and empathy, it is impossible to get to real causes and to build systems that are more resilient.

The conclusions we’re coming to are far-reaching. We’ve been discussing cultural change and DevOps, but we have hardly mentioned computing systems, software developers, or infrastructure engineers. It doesn’t matter a bit whether the postmortem is about a server outage or bad lending practices; the same principles apply.

If all companies are software companies, then all companies have to learn how to manage their online operations. But beyond that: on the web, we’ve seen dramatic decreases in product development time and dramatic increases in reliability and performance. Can those increases in productivity be extended through the whole enterprise, not just the online group? We believe so. Can practices like blameless postmortems make corporations more resilient in the face of failure, in addition to improving the lives of employees at every level? We believe so. Adoption of DevOps principles across the enterprise, and not just in the “online group,” will be a slow process, but it’s a necessary process. In five or 10 years, we’ll look back at who survived and who thrived, and we’ll see that the enterprises that have built communities of collaboration, mutual respect, and understandings have outperformed their competition.

DevOps isn’t just about Dev and Ops. It’s about corporate management as a whole; it’s about the entire corporate culture, from the janitors (who promise to keep the building clean) to the CEO (who promises to keep the company funded and the paychecks coming). Promise theory has emerged as the intellectual framework underpinning that change in culture. And Velocity is where we are discussing those changes.

See you in New York! Or Beijing or Barcelona!

Related:

tags: , ,