Empathy, communication, and collaboration across organizational boundaries.
I might try to define DevOps as the movement that doesn’t want to be defined. Or as the movement that wants to evade the inevitable cargo-culting that goes with most technical movements. Or the non-movement that’s resisting becoming a movement. I’ve written enough about “what is DevOps” that I should probably be given an honorary doctorate in DevOps Studies.
Baron Schwartz (among others) thinks it’s high time to have a definition, and that only a definition will save DevOps from an identity crisis. Without a definition, it’s subject to the whims of individual interest groups, and ultimately might become a movement that’s defined by nothing more than the desire to “not be like them.” Dave Zwieback (among others) says that the lack of a definition is more of a blessing than a curse, because it “continues to be an open conversation about making our organizations better.” Both have good points. Is it possible to frame DevOps in a way that preserves the openness of the conversation, while giving it some definition? I think so.
DevOps started as an attempt to think long and hard about the realities of running a modern web site, a problem that has only gotten more difficult over the years. How do we build and maintain critical sites that are increasingly complex, have stringent requirements for performance and uptime, and support thousands or millions of users? How do we avoid the “throw it over the wall” mentality, in which an operations team gets the fallout of the development teams’ bugs? How do we involve developers in maintenance without compromising their ability to release new software?
When you start to use the Chef configuration management system, you will quickly encounter a tool it ships with called Ohai which collects information about the underlying system to expose to the Chef Client as attributes during its run. These attributes allows you to easily incorporate system-specific behavior into your Chef cookbooks, for example allowing you to create a single “package” resource in your cookbook which automatically detects whether to install packages from Yum when running on a Redhat based Linux distribution or from Apt when running on a Debian based distribution. Read more…
Phil Dibowitz explains the challenges and the results they got with Chef
At OSCON, Phil Dibowitz reminded me how little I understand about large systems – as he puts it, really large systems, systems of systems, with some similarities but with different people controlling parts. His work at Facebook explores the challenges (and opportunities) of creating tools that work across a company’s many networks and computers.
If you deal with such challenges, he’s worth listening to as a model. If not, he’s worth listening to for a sense of just how different work at this scale can be, though much of what he accomplishes can be worthwhile at scales much smaller than the 17,000 servers he describes for Facebook at 26:42 in the session.
I talked with him in an interview:
and we’ve posted his OSCON session:
The NSA Can't Replace 90% of Its System Administrators
In the aftermath of Edward Snowden’s revelations about NSA’s domestic surveillance activities, the NSA has recently announced that they plan to get rid of 90% of their system administrators via software automation in order to “improve security.” So far, I’ve mostly seen this piece of news reported and commented on straightforwardly. But it simply doesn’t add up. Either the NSA has a monumental (yet not necessarily surprising) level of bureaucratic bloat that they could feasibly cut that amount of staff regardless of automation, or they are simply going to be less effective once they’ve reduced their staff. I talked with a few people who are intimately familiar with the kind of software that would typically be used for automation of traditional sysadmin tasks (Puppet and Chef). Typically, their products are used to allow an existing group of operations people to do much more, not attempting to do the same amount of work with significantly fewer people. The magical thinking that the NSA can actually put in automation sufficient to do away with 90% of their system administration staff belies some fundamental misunderstandings about automation. I’ll tackle the two biggest ones here.
1. Automation replaces people. Automation is about gaining leverage–it’s about streamlining human tasks that can be handled by computers in order to add mental brainpower. As James Turnbull, former VP of Business Development for PuppetLabs, said to me, “You still need smart people to think about and solve hard problems.” (Whether you agree with the types of problems the NSA is trying to solve is a completely different thing, of course.) In reality, the NSA should have been working on automation regardless of the Snowden affair. It has a massive, complex infrastructure. Deploying a new data center, for example, is a huge undertaking; it’s not something you can automate.
Or as Seth Vargo, who works for OpsCode–the creators of configuration management automation software Chef–puts it, “There’s still decisions to be made. And the machines are going to fail.” Sascha Bates (also with OpsCode) chimed in to point out that “This presumes that system administrators only manage servers.” It’s a naive view. Are the DBAs going away, too? Network administrators? As I mentioned earlier, the NSA has a massive, complicated infrastructure that will always require people to manage it. That plus all the stuff that isn’t (theoretically) being automated will now fall on the remaining 10% who don’t get laid off. And that remaining 10% will still have access to the same information.
2. Automation increases security. Automation increases consistency, which can have a relationship with security. Prior to automating something, you might have a wide variety of people doing the same thing in varying ways, hence with varying outcomes. From a security standpoint, automation provides infrastructure security, and makes it auditable. But it doesn’t really increase data/information security (e.g. this file can/cannot live on that server)–those too are human tasks requiring human judgement. And that’s just the kind of information Snowden got his hands on. This is another example of a government agency over-reacting to a low probability event after the fact. Getting rid of 90% of their sysadmins is the IT equivalent of still requiring airline passengers to take off their shoes and cram their tiny shampoo bottles into plastic baggies; it’s security theater.
There are a few upsides, depending on your perspective on this whole situation. First, if your company is in the market for system administrators, you might want to train your recruiters on D.C. in the near future. Additionally, odds are the NSA is going to be less effective than it is right now. Perhaps, like the CIA, they are also courting Amazon Web Services (AWS) to help run their own private cloud, but again, as Sascha said, managing servers is only a small piece of the system administrator picture.
If you care about or are interested in automation, operations, and security, please join us at Velocity New York on October 14-16. Dr. Nancy Leveson will be delivering a fantastic keynote on security and complex systems.
OSCON 2013 Speaker Series
Automating the configuration management of your operating systems and the rollout of your applications is one of the most important things an administrator or developer can do to avoid surprises when updating services, scaling up, or recovering from failures. However, it’s often not enough. Some of the most common operations that happen in your datacenter (or cloud environment) involve large numbers of machines working together and humans to mediate those processes. While we have been able to remove a lot of human effort from configuration, there has been a lack of software able to handle these higher-level operations.
I used to work for a hosted web application company where the IT process for executing an application update involved locking six people in a room for sometimes 3-4 hours, each person pressing the right buttons at the right time. This process almost always had a glitch somewhere where someone forgot to run the right command or something wasn’t well tested beforehand. While some technical solutions were applied to handle configuration automation, nothing that could perform configuration could really accomplish that high level choreography on top as well. This is why I wrote Ansible.
Ansible is a configuration management, application deployment, and IT orchestration system. One of Ansible’s strong points is having a very simple, human readable language – it allows users very fine, precise control over what happens on what machines at what times.
To get started, create an inventory file, for instance, ~/ansible_hosts that defines what machines you are managing, and which machines are frequently organized into groups. Ansible can also pull inventory from multiple cloud sources, but an inventory file is a quick way to get started:
# add more webservers here
Now that you have defined what machines you are managing, you have to define what you are going to do on the remote machines.
Ansible calls this description of processes a “playbook,” and you don’t have to have just one, you could have different playbooks for different kinds of tasks.
Let’s look at an example for describing a rolling update process. This example is somewhat involved because it’s using haproxy, but haproxy is freely available. Ansible also includes modules for dealing with Netscalers and F5 load balancers, so this is just an example — ordinarily you would start more simply and work up to an example like this:
Velocity 2013 Speaker Series
If you’re a System Administrator, you’re likely all too familiar with the 2:35am PagerDuty alert. “When you roll out testing on your infrastructure,” says Seth Vargo, “the number of alerts drastically decreases because you can build tests right into your Chef cookbooks.” We sat down to discuss his upcoming talk at Velocity, which promises to deliver many more restful nights for SysAdmins.
Key highlights from our discussion include:
- There are not currently any standards regarding testing with Chef. [Discussed at 1:09]
- A recommended workflow that starts with unit testing [Discussed at 2:11]
- Moving cookbooks through a “pipeline” of testing with Test Kitchen [Discussed at 3:11]
- In the event that something bad does make it into production, you can roll back actual infrastructure changes. [Discussed at 4:54]
- Automating testing and cookbook uploads with Jenkins [Discussed at 5:40]
You can watch the full interview here:
Velocity 2013 speaker series
“Puppet and Chef are completely different, and yet exactly the same,” admits Sascha Bates (@sascha_d). In this interview about her talk at the upcoming Velocity Conference, she discusses common pitfalls that people can avoid when getting started with configuration management. And here’s a hint: it isn’t about which tool you choose.
After years in the trenches helping a variety of organizations implement Chef, Sascha learned (often the hard way) a few critical things that she’ll share in her talk. Key points from our discussion include:
- When getting started with configuration management, people often fret over which tool they use, when they should be thinking more about the overall integration with their particular system. [Discussed at the 0:50 mark.]
- Both Chef and Puppet have the concept of a package manager, and if you’re not setting that up properly, things can spiral out of control quickly. [Discussed at the 1:25 mark.]
- Her top configuration management anti-patterns. [Discussed at the 2:43 mark.]
- What superpower someone will have after attending her talk. [Discussed at the 3:55 mark.]
Watch the whole interview here:
This is the first in a series of posts related to the upcoming Velocity Conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.