Courtney Nash

Everything is distributed

How do we manage systems that are too large to understand, too complex to control, and that fail in unpredictable ways?

Complexity

“What is surprising is not that there are so many accidents. It is that there are so few. The thing that amazes you is not that your system goes down sometimes, it’s that it is up at all.”—Richard Cook

In September 2007, Jean Bookout, 76, was driving her Toyota Camry down an unfamiliar road in Oklahoma, with her friend Barbara Schwarz seated next to her on the passenger side. Suddenly, the Camry began to accelerate on its own. Bookout tried hitting the brakes, applying the emergency brake, but the car continued to accelerate. The car eventually collided with an embankment, injuring Bookout and killing Schwarz. In a subsequent legal case, lawyers for Toyota pointed to the most common of culprits in these types of accidents: human error. “Sometimes people make mistakes while driving their cars,” one of the lawyers claimed. Bookout was older, the road was unfamiliar, these tragic things happen. Read more…

Comments: 5

The altar of shiny

Web design trends often carry hefty performance costs

Web and mobile users continue to expect faster sites and apps–especially when it comes to mobile–and this year I’d like to see people who work on the web spend more time focusing on performance as a user experience priority instead of chasing trends.

I recently ran across this article in Forbes, which lists a number of web design goals/trends that Steve Cooper is eyeing for a site redesign of online magazine Hitched. My intention is not to pick on Hitched or Cooper per se, but the list is a molotov cocktail of potential performance woes:

  • Continuous scrolling
  • Responsive design
  • Parallax sites

You can use most of those techniques without creating performance nightmares, but it is unfortunately rare. I feel like I’m living in an alternate reality where I’m hearing that users want simpler, faster sites, and yet the trends in web design are marching in the opposite direction.

Read more…

Comments: 3

Velocity: Toward the real-time business

Velocity 2013 Speaker Series

I want to start by thanking John and Steve for the warm welcome. They’ve created something very amazing with Velocity, and I’m excited to be a part of it.

It might seem a bit odd to talk about What’s Next at the beginning of a conference, but I figure the best time to go to the bank and ask for a loan is when you actually have some money.

What we’ve been talking about at Velocity, especially the DevOps side of things, is only the tip of the iceberg when it comes to how businesses are changing. And that shift is from the sequential to the concurrent. It used to be that we threw things over a series of walls, from Product Management to Design, to Development, to QA, to Production, to Customer Service and so on. That was an old world of software and one-year development cycles.

Read more…

Comment

Sharing is a competitive advantage

Why the Velocity conference is coming to New York.

In October, we’re bringing our Velocity conference to New York for the first time. Let’s face it, a company expanding its conference to other locations isn’t anything that unique. And given the thriving startup scene in New York, there’s no real surprise we’d like to have a presence there, either. In that sense, we’ll be doing what we’ve already been doing for years with the Velocity conference in California: sharing expert knowledge about the skills and technologies that are critical for building scalable, resilient, high-availability websites and services.

But there’s an even more compelling reason we’re looking to New York: the finance industry. We’d be foolish and remiss if we acted like it didn’t factor in to our decision, and that we didn’t also share some common concerns, especially on the operational side of things. The Velocity community spends a great deal of time navigating significant operational realities — infrastructure, cost, risk, failures, resiliency; we have a great deal to share with people working in finance, and I’d wager, a great deal to learn in return. If Google or Amazon go down, they lose money. (I’m not saying this is a good thing, mind you.) When a “technical glitch” occurs in financial service systems, we get flash crashes, a complete suspension of the Nasdaq, and whatever else comes next — all with potentially catastrophic outcomes.

Read more…

Comment

Sharing is a competitive advantage

Why the Velocity conference is coming to New York.

In October, we’re bringing our Velocity conference to New York for the first time. Let’s face it, a company expanding its conference to other locations isn’t anything that unique. And given the thriving startup scene in New York, there’s no real surprise we’d like to have a presence there, either. In that sense, we’ll be doing what we’ve already been doing for years with the Velocity conference in California: sharing expert knowledge about the skills and technologies that are critical for building scalable, resilient, high-availability websites and services.

But there’s an even more compelling reason we’re looking to New York: the finance industry. We’d be foolish and remiss if we acted like it didn’t factor in to our decision, and that we didn’t also share some common concerns, especially on the operational side of things. The Velocity community spends a great deal of time navigating significant operational realities — infrastructure, cost, risk, failures, resiliency; we have a great deal to share with people working in finance, and I’d wager, a great deal to learn in return. If Google or Amazon go down, they lose money. (I’m not saying this is a good thing, mind you.) When a “technical glitch” occurs in financial service systems, we get flash crashes, a complete suspension of the Nasdaq, and whatever else comes next — all with potentially catastrophic outcomes. Read more…

Comment: 1

Automation Myths

The NSA Can't Replace 90% of Its System Administrators

[contextly_sidebar id="ae5056f4eabc1391326c2b1119c7327c"]

In the aftermath of Edward Snowden’s revelations about NSA’s domestic surveillance activities, the NSA has recently announced that they plan to get rid of 90% of their system administrators via software automation in order to “improve security.” So far, I’ve mostly seen this piece of news reported and commented on straightforwardly. But it simply doesn’t add up. Either the NSA has a monumental (yet not necessarily surprising) level of bureaucratic bloat that they could feasibly cut that amount of staff regardless of automation, or they are simply going to be less effective once they’ve reduced their staff. I talked with a few people who are intimately familiar with the kind of software that would typically be used for automation of traditional sysadmin tasks (Puppet and Chef). Typically, their products are used to allow an existing group of operations people to do much more, not attempting to do the same amount of work with significantly fewer people. The magical thinking that the NSA can actually put in automation sufficient to do away with 90% of their system administration staff belies some fundamental misunderstandings about automation. I’ll tackle the two biggest ones here.

1. Automation replaces people. Automation is about gaining leverage–it’s about streamlining human tasks that can be handled by computers in order  to add mental brainpower. As James Turnbull, former VP of Business Development for PuppetLabs, said to me, “You still need smart people to think about and solve hard problems.” (Whether you agree with the types of problems the NSA is trying to solve is a completely different thing, of course.) In reality, the NSA should have been working on automation regardless of the Snowden affair. It has a massive, complex infrastructure. Deploying a new data center, for example, is a huge undertaking; it’s not something you can automate.

Or as Seth Vargo, who works for OpsCode–the creators of configuration management automation software Chef–puts it, “There’s still decisions to be made. And the machines are going to fail.” Sascha Bates (also with OpsCode) chimed in to point out that “This presumes that system administrators only manage servers.” It’s a naive view. Are the DBAs going away, too? Network administrators? As I mentioned earlier, the NSA has a massive, complicated infrastructure that will always require people to manage it. That plus all the stuff that isn’t (theoretically) being automated will now fall on the remaining 10% who don’t get laid off. And that remaining 10% will still have access to the same information.

2. Automation increases security. Automation increases consistency, which can have a relationship with security. Prior to automating something, you might have a wide variety of people doing the same thing in varying ways, hence with varying outcomes. From a security standpoint, automation provides infrastructure security, and makes it auditable. But it doesn’t really increase data/information security (e.g. this file can/cannot live on that server)–those too are human tasks requiring human judgement. And that’s just the kind of information Snowden got his hands on. This is another example of a government agency over-reacting to a low probability event after the fact. Getting rid of 90% of their sysadmins is the IT equivalent of still requiring airline passengers to take off their shoes and cram their tiny shampoo bottles into plastic baggies; it’s security theater.

There are a few upsides, depending on your perspective on this whole situation. First, if your company is in the market for system administrators, you might want to train your recruiters on D.C. in the near future. Additionally, odds are the NSA is going to be less effective than it is right now. Perhaps, like the CIA, they are also courting Amazon Web Services (AWS) to help run their own private cloud, but again, as Sascha said, managing servers is only a small piece of the system administrator picture.

If you care about or are interested in automation, operations, and security, please join us at Velocity New York on October 14-16. Dr. Nancy Leveson will be delivering a fantastic keynote on security and complex systems.

Comments: 22

Velocity CA Recap

Failure is a Feature

The Santa Clara edition of our Velocity conference wrapped up a little over a week ago, and I’ve had a chance to reflect on the formal talks and excellent hallway conversations I had throughout. Here are a few themes I saw, including a few of the standout talks:

1. Velocity continues to grow. I had to qualify that I’d been to the Santa Clara conference, because it’s now cropped up in three more locations annually, starting with China and Europe last year, and moving to the newest location this year: New York in October. I’m excited to see what new perspectives this will bring, most notably on the financial industry side of things.

2. The web is getting faster (barely). Steve Souders mentioned this in his keynote at HTML5 Developer Conference (with a related writeup here), and Tammy Evert’s excellent summary of her experience at Velocity provided a slightly depressing list of things that people still aren’t really doing, or are struggling with: third-party content, images, caching, web fonts and… JavaScript. Along with Tammy, I also noticed some varying opinions about how helpful Responsive Web Design really is when it comes to mobile performance. As mobile usage continues to grow dramatically (and the greatest growth is in developing world, on slow cellular networks & basic devices), these pain points are only multiplied by processors optimized for battery consumption vs. CPU. The highlight on this front for me was Ilya Grigorik’s talk on Optimizing the Critical Rendering Path for Instant Mobile Websites (note: I’m wholly biased here, as I’m editing his soon-to-be-released book, High Performance Browser Networking).

3. Perception matters (and page load time doesn’t measure it). Quite a few talks hit on the idea of getting the most critical information in front of people first, and letting the rest load after. (Steve Souders gave a really great Ignite talk on this as well.) And with single-page apps, the very concept of page load goes out the window (pun intended) almost entirely. My favorite talk on this front was Rachel Myers and Emily Nakashima’s case study of work they’d done (previously) at ModCloth. The bottom line: feature load time was a far more useful performance metric for them–and their management team–when it came to the single-page application they’d built. They’d cobbled their own solution together using Google Analytics and Circonus to track feature load time, but it looks like the new product announced at Velocity from New Relic might just provide that out of the box now. Their presentation also had ostriches and yaks for a little extra awesome.

4. Failure is a feature (and you should plan for it at all levels of your organization and products). The opening keynote from Johan Bergstrom provided  a fascinating perspective on risk in complex systems (e.g. web operations). While he didn’t provide any concrete ways to assess your own risk (and that was part of the point), what I took away from it was this: If you’re assessing your risk as a function of the severity and probability of technical components of your system going down (e.g. are they “reliable”, you’re missing a key piece of the picture. Organizations need to factor in humans as some of those components (or “actors”), and look at the function of a complex system via the interdependencies and relationships between actors. It is constantly, dynamically changing, and risk is a product of all the interactions within the system. (For more reading on this, I highly suggest some of Johan’s references in his blog post about the talk, notably Sidney Dekker’s work.)

Dylan Richard also gave a fantastic keynote about the gameday scenarios he ran during the Obama campaign.  The bottom line: Plan for failure. Design your apps and your team to be able to handle it when it happens.

5. A revolution is coming (and there be dinosaurs). Whither Circuit City and Blockbuster? They didn’t just get eaten by Best Buy and Netflix randomly–they failed to see the writing on IT’s wall. With transformative technologies like the cloud and infrastructure automation, the backend is not so back room any longer. And performance isn’t just about the speed of your site or app. Adam Jacobs gave a talk at the very end of the conference (which Jesse Robbins reprised the next day at DevOpsDays) that was a rallying cry for people in IT and Operations: you control the destiny of your organization. It oversimplified many things, in my opinion, but the core message was there, and something we’ve been saying at O’Reilly for a little while now, too: Every business is now an Internet business. The dinosaurs will be those who, in Adam’s words, fail to “leverage digital commerce to rapidly deliver goods and services to consumers.” In other words, transform or die.

You can see all the keynotes, plus interviews and other related Velocity video goodness on our YouTube channel. You can also purchase the complete video compilation that includes all the tutorials and sessions, as well.

 

Comment

Test-driven Infrastructure with Chef

Velocity 2013 Speaker Series

If you’re a System Administrator, you’re likely all too familiar with the 2:35am PagerDuty alert. “When you roll out testing on your infrastructure,” says Seth Vargo, “the number of alerts drastically decreases because you can build tests right into your Chef cookbooks.” We sat down to discuss his upcoming talk at Velocity, which promises to deliver many more restful nights for SysAdmins.

Key highlights from our discussion include:

  • There are not currently any standards regarding testing with Chef.  [Discussed at 1:09]
  • A recommended workflow that starts with unit testing  [Discussed at 2:11]
  • Moving cookbooks through a “pipeline” of testing with Test Kitchen [Discussed at 3:11]
  • In the event that something bad does make it into production, you can roll back actual infrastructure changes. [Discussed at 4:54]
  • Automating testing and cookbook uploads with Jenkins [Discussed at 5:40]

You can watch the full interview here:

 

Comment

Ops Mythology

Velocity 2013 Speaker Series

At some point, we’ve all ended up trading horror stories over drinks with colleagues. Heads nod and shake in sympathy, and the stories get hairier as the night goes on. And while it of course feels good to get some of that dirt off your shoulder, is there a larger, better purpose to sharing war stories? I sat down with James Turnbull of Puppet Labs (@kartar) to chat about his upcoming Velocity talk about Ops mythology, and how we might be able to turn our tales of disaster into triumph.

Key highlights of our discussion include:

  • Why do we share disaster stories? What is the attraction? [Discussed at 0:40]
  • Stories are about shared experience and bonding with members of our community. [Discussed at 2:10]
  • These horror stories are like mythological “big warnings” that help enforce social order, which isn’t always a good thing. [Discussed at 4:18]
  • A preview of how his talk will be about moving away from the bad stories so people can keep telling more good stories. (Also: s’mores.) [Discussed at 7:15]

You can watch the entire interview here:

This is one of a series of posts related to the upcoming Velocity conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.

Comment

Google Glass: What Developers Need to Know about This New Platform

Creating Glassware today and what's in store for tomorrow

You’ve likely already seen pictures of people using Google Glass, if not had an actual in-the-wild spotting as well. After getting a quick demo myself, I spoke with Maximiliano Firtman about his talk at Fluent conference that covers what developers need to start doing and thinking about when it comes to developing apps for this new environment.

Key highlights include:

  • The current version supports cloud-based web applications that can be built in any language using the Mirror API. [Discussed at 0:30]
  • A forthcoming SDK will support native app development, essentially Android apps written in Java. [Discussed at 2:20]
  • The only truly augmented reality type application currently available is Google Maps. [Discussed at 3:30]
  • Developers need to think outside the technical details as well, and spend time considering how people will be interacting with Google Glass—it’s a uniquely new paradigm with unique use cases. [Discussed at 4:14]
  • While the beta (Explorer) program is currently closed, Max expects to see more devices available and “on the street” within the next year. [Discussed at 6:10]

You can view the full interview here:

Read more…

Comment