The secret to successful infrastructure automation is people.
“The trouble with automation is that it often gives us what we don’t need at the cost of what we do.” —Nicholas Carr, The Glass Cage: Automation and Us
Virtualization and cloud hosting platforms have pervasively decoupled infrastructure from its underlying hardware over the past decade. This has led to a massive shift towards what many are calling dynamic infrastructure, wherein infrastructure and the tools and services used to manage it are treated as code, allowing operations teams to adopt software approaches that have dramatically changed how they operate. But with automation comes a great deal of fear, uncertainty and doubt.
Common (mis)perceptions of automation tend to pop up at the extreme ends: It will either liberate your people to never have to worry about mundane tasks and details, running intelligently in the background, or it will make SysAdmins irrelevant and eventually replace all IT jobs (and beyond). Of course, the truth is very much somewhere in between, and relies on a fundamental rethinking of the relationship between humans and automation.
The daily work of building and deploying complex software.
You are desperate to communicate, to edify or entertain, to preserve moments of grace or joy or transcendence, to make real or imagined events come alive. But you cannot will this to happen. It is a matter of persistence and faith and hard work. So you might as well just go ahead and get started. — Anne Lamott, bird by bird
Words like ‘persistence’ and ‘work’ rarely show up in the same sentence as DevOps. It is more likely to be characterized as that One Weird Trick that will suddenly make your entire software development and deployment pipeline work faster and without failure every time. There are DevOps consultants and entrepreneurs — people and companies promising jetpacks and hovercrafts delivered on schedule and with cost savings. This is the software equivalent of the legendary savant writer struck by divine genius, churning out perfect page after page without fail and swimming in millions from her bestselling novels.
In reality, DevOps is quite similar to writing — it requires concerted, daily effort — but instead of sitting down to write every day, the principles look something more like:
- Release regularly
- Release in small chunks
- Test in production (or, less provocatively: do not expect to release something perfect all the time)
- Collect system/performance data and user feedback
- Refine, optimize, repeat
A "Coded Business" harnesses feedback loops, optimization, ubiquitous delivery, and other web-centric methods.
Seven years ago, Steve Souders and Jesse Robbins came to the realization that they both worked within “tribes” that, while ostensibly quite different, were talking about many of the same things. Front-end developers and engineers were figuring out how to make web pages faster and more reliable, and web operations folks were making deployments faster and more resilient.
And so goes the tale of how Velocity came to be — a conference that brought those tribes together and openly shared how to make the web faster and stronger. In those seven years, quite a lot has changed, and many ideas, terms, and technologies have come into being — some directly as a result of Velocity, others were already in the works. DevOps, Chef, Puppet, Continuous Delivery, HTTP Archive — these were the earlier forays. Soon to follow were AWS, Application Performance Monitoring (APM) products, many more monitoring tools, many more CDN vendors, WebPageTest, the explosion of the cloud, Chaos Monkey, mobile everything, Vagrant, Docker, and much, much more.
Out of the fire of Velocity came a New Way of doing things forged in a web-centric world. Along the way, something changed fundamentally about not just tech companies, but companies in general. As we looked around more recently, we realized it wasn’t just about the web and fast pages any more. Read more…
How do we manage systems that are too large to understand, too complex to control, and that fail in unpredictable ways?
“What is surprising is not that there are so many accidents. It is that there are so few. The thing that amazes you is not that your system goes down sometimes, it’s that it is up at all.”—Richard Cook
In September 2007, Jean Bookout, 76, was driving her Toyota Camry down an unfamiliar road in Oklahoma, with her friend Barbara Schwarz seated next to her on the passenger side. Suddenly, the Camry began to accelerate on its own. Bookout tried hitting the brakes, applying the emergency brake, but the car continued to accelerate. The car eventually collided with an embankment, injuring Bookout and killing Schwarz. In a subsequent legal case, lawyers for Toyota pointed to the most common of culprits in these types of accidents: human error. “Sometimes people make mistakes while driving their cars,” one of the lawyers claimed. Bookout was older, the road was unfamiliar, these tragic things happen. Read more…
Web design trends often carry hefty performance costs
Web and mobile users continue to expect faster sites and apps–especially when it comes to mobile–and this year I’d like to see people who work on the web spend more time focusing on performance as a user experience priority instead of chasing trends.
I recently ran across this article in Forbes, which lists a number of web design goals/trends that Steve Cooper is eyeing for a site redesign of online magazine Hitched. My intention is not to pick on Hitched or Cooper per se, but the list is a molotov cocktail of potential performance woes:
- Continuous scrolling
- Responsive design
- Parallax sites
You can use most of those techniques without creating performance nightmares, but it is unfortunately rare. I feel like I’m living in an alternate reality where I’m hearing that users want simpler, faster sites, and yet the trends in web design are marching in the opposite direction.
Velocity 2013 Speaker Series
I want to start by thanking John and Steve for the warm welcome. They’ve created something very amazing with Velocity, and I’m excited to be a part of it.
It might seem a bit odd to talk about What’s Next at the beginning of a conference, but I figure the best time to go to the bank and ask for a loan is when you actually have some money.
What we’ve been talking about at Velocity, especially the DevOps side of things, is only the tip of the iceberg when it comes to how businesses are changing. And that shift is from the sequential to the concurrent. It used to be that we threw things over a series of walls, from Product Management to Design, to Development, to QA, to Production, to Customer Service and so on. That was an old world of software and one-year development cycles.
Why the Velocity conference is coming to New York.
In October, we’re bringing our Velocity conference to New York for the first time. Let’s face it, a company expanding its conference to other locations isn’t anything that unique. And given the thriving startup scene in New York, there’s no real surprise we’d like to have a presence there, either. In that sense, we’ll be doing what we’ve already been doing for years with the Velocity conference in California: sharing expert knowledge about the skills and technologies that are critical for building scalable, resilient, high-availability websites and services.
But there’s an even more compelling reason we’re looking to New York: the finance industry. We’d be foolish and remiss if we acted like it didn’t factor in to our decision, and that we didn’t also share some common concerns, especially on the operational side of things. The Velocity community spends a great deal of time navigating significant operational realities — infrastructure, cost, risk, failures, resiliency; we have a great deal to share with people working in finance, and I’d wager, a great deal to learn in return. If Google or Amazon go down, they lose money. (I’m not saying this is a good thing, mind you.) When a “technical glitch” occurs in financial service systems, we get flash crashes, a complete suspension of the Nasdaq, and whatever else comes next — all with potentially catastrophic outcomes. Read more…
The NSA Can't Replace 90% of Its System Administrators
In the aftermath of Edward Snowden’s revelations about NSA’s domestic surveillance activities, the NSA has recently announced that they plan to get rid of 90% of their system administrators via software automation in order to “improve security.” So far, I’ve mostly seen this piece of news reported and commented on straightforwardly. But it simply doesn’t add up. Either the NSA has a monumental (yet not necessarily surprising) level of bureaucratic bloat that they could feasibly cut that amount of staff regardless of automation, or they are simply going to be less effective once they’ve reduced their staff. I talked with a few people who are intimately familiar with the kind of software that would typically be used for automation of traditional sysadmin tasks (Puppet and Chef). Typically, their products are used to allow an existing group of operations people to do much more, not attempting to do the same amount of work with significantly fewer people. The magical thinking that the NSA can actually put in automation sufficient to do away with 90% of their system administration staff belies some fundamental misunderstandings about automation. I’ll tackle the two biggest ones here.
1. Automation replaces people. Automation is about gaining leverage–it’s about streamlining human tasks that can be handled by computers in order to add mental brainpower. As James Turnbull, former VP of Business Development for PuppetLabs, said to me, “You still need smart people to think about and solve hard problems.” (Whether you agree with the types of problems the NSA is trying to solve is a completely different thing, of course.) In reality, the NSA should have been working on automation regardless of the Snowden affair. It has a massive, complex infrastructure. Deploying a new data center, for example, is a huge undertaking; it’s not something you can automate.
Or as Seth Vargo, who works for OpsCode–the creators of configuration management automation software Chef–puts it, “There’s still decisions to be made. And the machines are going to fail.” Sascha Bates (also with OpsCode) chimed in to point out that “This presumes that system administrators only manage servers.” It’s a naive view. Are the DBAs going away, too? Network administrators? As I mentioned earlier, the NSA has a massive, complicated infrastructure that will always require people to manage it. That plus all the stuff that isn’t (theoretically) being automated will now fall on the remaining 10% who don’t get laid off. And that remaining 10% will still have access to the same information.
2. Automation increases security. Automation increases consistency, which can have a relationship with security. Prior to automating something, you might have a wide variety of people doing the same thing in varying ways, hence with varying outcomes. From a security standpoint, automation provides infrastructure security, and makes it auditable. But it doesn’t really increase data/information security (e.g. this file can/cannot live on that server)–those too are human tasks requiring human judgement. And that’s just the kind of information Snowden got his hands on. This is another example of a government agency over-reacting to a low probability event after the fact. Getting rid of 90% of their sysadmins is the IT equivalent of still requiring airline passengers to take off their shoes and cram their tiny shampoo bottles into plastic baggies; it’s security theater.
There are a few upsides, depending on your perspective on this whole situation. First, if your company is in the market for system administrators, you might want to train your recruiters on D.C. in the near future. Additionally, odds are the NSA is going to be less effective than it is right now. Perhaps, like the CIA, they are also courting Amazon Web Services (AWS) to help run their own private cloud, but again, as Sascha said, managing servers is only a small piece of the system administrator picture.
If you care about or are interested in automation, operations, and security, please join us at Velocity New York on October 14-16. Dr. Nancy Leveson will be delivering a fantastic keynote on security and complex systems.
Failure is a Feature
The Santa Clara edition of our Velocity conference wrapped up a little over a week ago, and I’ve had a chance to reflect on the formal talks and excellent hallway conversations I had throughout. Here are a few themes I saw, including a few of the standout talks:
1. Velocity continues to grow. I had to qualify that I’d been to the Santa Clara conference, because it’s now cropped up in three more locations annually, starting with China and Europe last year, and moving to the newest location this year: New York in October. I’m excited to see what new perspectives this will bring, most notably on the financial industry side of things.
3. Perception matters (and page load time doesn’t measure it). Quite a few talks hit on the idea of getting the most critical information in front of people first, and letting the rest load after. (Steve Souders gave a really great Ignite talk on this as well.) And with single-page apps, the very concept of page load goes out the window (pun intended) almost entirely. My favorite talk on this front was Rachel Myers and Emily Nakashima’s case study of work they’d done (previously) at ModCloth. The bottom line: feature load time was a far more useful performance metric for them–and their management team–when it came to the single-page application they’d built. They’d cobbled their own solution together using Google Analytics and Circonus to track feature load time, but it looks like the new product announced at Velocity from New Relic might just provide that out of the box now. Their presentation also had ostriches and yaks for a little extra awesome.
4. Failure is a feature (and you should plan for it at all levels of your organization and products). The opening keynote from Johan Bergstrom provided a fascinating perspective on risk in complex systems (e.g. web operations). While he didn’t provide any concrete ways to assess your own risk (and that was part of the point), what I took away from it was this: If you’re assessing your risk as a function of the severity and probability of technical components of your system going down (e.g. are they “reliable”, you’re missing a key piece of the picture. Organizations need to factor in humans as some of those components (or “actors”), and look at the function of a complex system via the interdependencies and relationships between actors. It is constantly, dynamically changing, and risk is a product of all the interactions within the system. (For more reading on this, I highly suggest some of Johan’s references in his blog post about the talk, notably Sidney Dekker’s work.)
Dylan Richard also gave a fantastic keynote about the gameday scenarios he ran during the Obama campaign. The bottom line: Plan for failure. Design your apps and your team to be able to handle it when it happens.
5. A revolution is coming (and there be dinosaurs). Whither Circuit City and Blockbuster? They didn’t just get eaten by Best Buy and Netflix randomly–they failed to see the writing on IT’s wall. With transformative technologies like the cloud and infrastructure automation, the backend is not so back room any longer. And performance isn’t just about the speed of your site or app. Adam Jacobs gave a talk at the very end of the conference (which Jesse Robbins reprised the next day at DevOpsDays) that was a rallying cry for people in IT and Operations: you control the destiny of your organization. It oversimplified many things, in my opinion, but the core message was there, and something we’ve been saying at O’Reilly for a little while now, too: Every business is now an Internet business. The dinosaurs will be those who, in Adam’s words, fail to “leverage digital commerce to rapidly deliver goods and services to consumers.” In other words, transform or die.
You can see all the keynotes, plus interviews and other related Velocity video goodness on our YouTube channel. You can also purchase the complete video compilation that includes all the tutorials and sessions, as well.
Velocity 2013 Speaker Series
If you’re a System Administrator, you’re likely all too familiar with the 2:35am PagerDuty alert. “When you roll out testing on your infrastructure,” says Seth Vargo, “the number of alerts drastically decreases because you can build tests right into your Chef cookbooks.” We sat down to discuss his upcoming talk at Velocity, which promises to deliver many more restful nights for SysAdmins.
Key highlights from our discussion include:
- There are not currently any standards regarding testing with Chef. [Discussed at 1:09]
- A recommended workflow that starts with unit testing [Discussed at 2:11]
- Moving cookbooks through a “pipeline” of testing with Test Kitchen [Discussed at 3:11]
- In the event that something bad does make it into production, you can roll back actual infrastructure changes. [Discussed at 4:54]
- Automating testing and cookbook uploads with Jenkins [Discussed at 5:40]
You can watch the full interview here: