"velocityconf" entries

What Is the Risk That Amazon Will Go Down (Again)?

Velocity 2013 Speaker Series

Why should we at all bother about notions such as risk and safety in web operations? Do web operations face risk? Do web operations manage risk? Do web operations produce risk? Last Christmas Eve, Amazon had an AWS outage affecting a variety of actors, including Netflix, which was a service included in many of the gifts shared on that very day. The event has introduced the notion of risk into the discourse of web operations, and it might then be good timing for some reflective thoughts on the very nature of risk in this domain.

What is risk? The question is a classic one, and the answer is tightly coupled to how one views the nature of the incident occurring as a result of the risk.

One approach to assessing the risk of Amazon going down is probabilistic: start by laying out the entire space of potential scenarios leading to Amazon going down, calculate their probability, and multiply the probability for each scenario by their estimated severity (likely in terms of the costs connected to the specific scenario depending on the time of the event). Each scenario can then be plotted in a risk matrix showing their weighted ranking (to prioritize future risk mitigation measures) or calculated as a collective sum of the risks for each scenario (to judge whether the risk for Amazon going down is below a certain acceptance criterion).

This first way of answering the question of what the risk is for Amazon to go down is intimately linked with a perception of risk as energy to be kept contained (Haddon, 1980). This view originates from more recent times of increased development of process industries in which clearly graspable energies (fuel rods at nuclear plants, the fossil fuels at refineries, the kinetic energy of an aircraft) are to be kept contained and safely separated from a vulnerable target such as human beings. The next question of importance here becomes how to avoid an uncontrolled release of the contained energy. The strategies for mitigating the risk of an uncontrolled release of energy are basically two: barriers and redundancy (and the two combined: redundancy of barriers). Physically graspable energies can be contained through the use of multiple barriers (called “defenses in depth”) and potentially several barriers of the same kind (redundancy), for instance several emergency-cooling systems for a nuclear plant.

Using this metaphor, the risk of Amazon going down is mitigated by building a system of redundant barriers (several server centers, backup, active fire extinguishing, etc.). This might seem like a tidy solution, but here we run into two problems with this probabilistic approach to risk: the view of the human operating the system and the increased complexity that comes as a result of introducing more and more barriers.

Controlling risk by analyzing the complete space of possible (and graspable) scenarios basically does not distinguish between safety and reliability. From this view, a system is safe when it is reliable, and the reliability of each barrier can be calculated. However there is one system component that is more difficult to grasp in terms of reliability than any other: the human. Inevitably, proponents of the energy/barrier model of risk end up explaining incidents (typically accidents) in terms of unreliable human beings not guaranteeing the safety (reliability) of the inherently safe (risk controlled by reliable barriers) system. I think this problem—which has its own entire literature connected to it—is too big to outline in further detail in this blog post, but let me point you towards a few references: Dekker, 2005; Dekker, 2006; Woods, Dekker, Cook, Johannesen & Sarter, 2009. The only issue is these (and most other citations in this post) are all academic tomes, so for those who would prefer a shorter summary available online, I can refer you to this report. I can also reassure you that I will get back to this issue in my keynote speech at the Velocity conference next month. To put the critique short: the contemporary literature questions the view of humans as the unreliable component of inherently safe systems, and instead advocates a view of humans as the only ones guaranteeing safety in inherently complex and risky environments.
Read more…

End-to-End JavaScript Quality Analysis

Velocity 2013 Speaker Series

The rise of single-page web applications means that front-end developers need to pay attention not only to network transport optimization, but also to rendering and computation performance. With applications written in JavaScript, the language tooling itself has not really caught up with the demand of richer, assorted performance metrics necessary in such a development workflow. Fortunately, some emerging tools are starting to show up that can serve as a stop-gap measure until the browser itself provides the native support for those metrics. I’ll be covering a number in my talk at Velocity next month, but here’s a quick sneak preview of a few.

Code coverage

One important thing that shapes the overall single-page application performance is instrumentation of the application code. The most obvious use-case is for analyzing code coverage, particularly when running unit tests and functional tests. Code that never gets executed during the testing process is an accident waiting to happen. While it is unreasonable to have 100% coverage, having no coverage data at all does not provide a lot of confidence. These days, we are seeing easy-to-use coverage tools such as Istanbul and Blanket.js become widespread, and they work seamlessly with popular test frameworks such as Jasmine, Mocha, Karma, and many others.

Complexity

Instrumented code can be leveraged to perform another type of analysis: run-time scalability. Performance is often measured by the elapsed time, e.g. how long it takes to perform a certain operation. This stopwatch approach only tells half of the story. For example, testing the performance of sorting 10 contacts in 10 ms in an address book application doesn’t tell anything about the complexity of that address book. How will it cope with 100 contacts? 1,000 contacts? Since it is not always practical to carry out a formal analysis on the application code to figure out its complexity, the workaround is to figure out the empirical run-time complexity. In this example, it can be done by instrumenting and monitoring a particular part of the sorting implementation—probably the “swap two entries” function—and watch the behavior with different input sizes.

As JavaScript applications are getting more and more complex, some steps are necessary to keep the code as readable and as understandable as possible. With a tool like JSComplexity, code complexity metrics can be obtained in static analysis steps. Even better, you can track both McCabe’s cyclomatic complexity and Halstead complexity measures of every function over time. This prevents accidental code changes that could be adding more complexity to the code. For the application dashboard or continuous integration panel, these complexity metrics can be visualized using Plato in a few easy steps.

Read more…

Kate Matsudaira: If You Don’t Understand People, You Don’t Understand Ops

Velocity 2013 Speaker Series

While automation is clearly making everyone’s lives who work in Operations much better, startup founder Kate Matsudaira (@katemats) acknowledges that “No one ever does their work in a vaccum.” You can try as much as possible to Automate All The Things, but you can’t automate trust. And trust is key to a healthy, thriving operations team (and your own professional growth, too).

In this interview, Kate discusses some of the things she’ll be talking about at Velocity next month. Key highlights include:

  •  The word “people” is pretty broad. What aspects of working with people should operations teams care about? [Discussed at 1:32]
  • Ultimately, you depend on the people around you to help get work done, especially when you need to get funding, be it externally for a startup, or internally for an infrastructure or refactoring project. The more people trust you, the more likely that is to happen. [Discussed at 3:17]
  • Cultural change takes leadership, but that leadership doesn’t have to come from the top. [Discussed at 5:00]
  • You can be ridiculously technically competent, but if you can’t communicate well, it hinders your success in the long run. [Discussed at 5:40]

You can view the entire interview here:

This is one of a series of posts related to the upcoming Velocity Conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.

 

Sascha Bates on Configuration Management: It’s Not about the Tool

Velocity 2013 speaker series

“Puppet and Chef are completely different, and yet exactly the same,” admits Sascha Bates (@sascha_d). In this interview about her talk at the upcoming Velocity Conference, she discusses common pitfalls that people can avoid when getting started with configuration management. And here’s a hint: it isn’t about which tool you choose.

After years in the trenches helping a variety of organizations implement Chef, Sascha learned (often the hard way) a few critical things that she’ll share in her talk. Key points from our discussion include:

  • When getting started with configuration management, people often fret over which tool they use, when they should be thinking more about the overall integration with their particular system. [Discussed at the 0:50 mark.]
  • Both Chef and Puppet have the concept of a package manager, and if you’re not setting that up properly, things can spiral out of control quickly. [Discussed at the 1:25 mark.]
  • Her top configuration management anti-patterns. [Discussed at the 2:43 mark.]
  • What superpower someone will have after attending her talk. [Discussed at the 3:55 mark.]

Watch the whole interview here:

This is the first in a series of posts related to the upcoming Velocity Conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.

 

Velocity Culture: Web Operations, DevOps, etc…

Velocity 2010 is happening on June 22-24 (right around the corner!). This year we’ve added third track, Velocity Culture, dedicated to exploring what we’ve learned about how great teams and organizations work together to succeed at scale.

Web Operations, or WebOps, is what many of us have been calling these ideas for years. Recently the term “DevOps” has become a kind of rallying cry that is resonating with many, along with variations on Agile Operations.

Velocity 2010: Fast By Default

We’re entering our third year of Velocity, the Web Performance & Operations Conference. Velocity 2010 will be June 22-24, 2010 in Santa Clara, CA. It’s going to be another incredible year. Steve Souders & I have set a new theme this year, “Fast by Default”. We want the broader Velocity community & to adopt it as a shared mission & mantra. The reason for this is simple.

More on how web performance impacts revenue…

At Velocity this year Microsoft, Google and Shopzilla each presented data on how web performance directly impacts revenue. Their data showed that slow sites get fewer search queries per user, less revenue per visitor, fewer clicks, fewer searches, and lower search engine rankings. They found that in some cases even after site performance was improved users continued to interact as if it was slow. Bad experiences have a lasting influence on customer behavior.

John Adams on Fixing Twitter: Improving the Performance and Scalability of the World's Most Popular Micro-blogging Site

Twitter is suffering outages today as they fend off a Denial of Service attack, and so I thought it would be helpful to post John Adams’ exceptional Velocity session about Operations at Twitter. Good luck today John & team… I know it’s going to be a long day! Update: Apparently Facebook & Livejournal have had similar attacks today. Rich Miller…

John Adams on Fixing Twitter: Improving the Performance and Scalability of the World’s Most Popular Micro-blogging Site

Twitter is suffering outages today as they fend off a Denial of Service attack, and so I thought it would be helpful to post John Adams’ exceptional Velocity session about Operations at Twitter. Good luck today John & team… I know it’s going to be a long day! Update: Apparently Facebook & Livejournal have had similar attacks today. Rich Miller…

Velocity and the Bottom Line

Velocity 2009 took place last week in San Jose, with Jesse Robbins
and I serving as co-chairs. Back in
November 2008, while we were planning Velocity, I said I wanted to highlight “best practices in performance and operations that improve the user experience as well as the company’s bottom line.” Much of my work focuses on the how of improving performance – tips developers use to create even faster web sites. What’s been missing is the why. Why is it important for companies to focus on performance?