"operations" entries

Applied DevOps and the potential of Docker

The cultural impact within a software engineering organization can be dramatic.

Editor’s note: this post is from Karl Matthias and Sean P. Kane, authors of “Docker Up & Running,” a guide to quickly learn how to use Docker to create packaged images for easy management, testing, and deployment of software.

At the Python Developers Conference in Santa Clara, California, on March 15th, 2013, with no pre-announcement and little fanfare, Solomon Hykes, the founder and CEO of dotCloud, gave a 5-minute lightning talk where he first introduced the world to a brand new tool for Linux called Docker. It was a response to the hardships of shipping software at scale in a fast-paced world, and takes an approach that makes it easy to map organizational processes to the principles of DevOps.

The capabilities of the typical software engineering company have often not kept pace with the quickly evolving expectations of the average technology user. Users today expect fast, reliable systems with continuous improvements, ease of use, and broad integrations. Many in the industry see the principles of DevOps as a giant leap toward building organizations that meet the challenges of delivering high quality software in today’s market. Docker is aimed at these challenges.

Read more…

Comment: 1

How we got to the HTTP/2 and HPACK RFCs

A brief history of SPDY and HTTP/2.

Download HTTP/2.

Download HTTP/2.

Download a free copy of HTTP/2. This post is an excerpt by Ilya Grigorik from High Performance Browser Networking, the essential guide to networking and web performance for web developers.

SPDY was an experimental protocol, developed at Google and announced in mid-2009, whose primary goal was to try to reduce the load latency of web pages by addressing some of the well-known performance limitations of HTTP/1.1. Specifically, the outlined project goals were set as follows:

  • Target a 50% reduction in page load time (PLT).
  • Avoid the need for any changes to content by website authors.
  • Minimize deployment complexity, avoid changes in network infrastructure.
  • Develop this new protocol in partnership with the open-source community.
  • Gather real performance data to (in)validate the experimental protocol.

To achieve the 50% PLT improvement, SPDY aimed to make more efficient use of the underlying TCP connection by introducing a new binary framing layer to enable request and response multiplexing, prioritization, and header compression.

Not long after the initial announcement, Mike Belshe and Roberto Peon, both software engineers at Google, shared their first results, documentation, and source code for the experimental implementation of the new SPDY protocol:

So far we have only tested SPDY in lab conditions. The initial results are very encouraging: when we download the top 25 websites over simulated home network connections, we see a significant improvement in performance—pages loaded up to 55% faster.

— A 2x Faster Web Chromium Blog

Fast-forward to 2012 and the new experimental protocol was supported in Chrome, Firefox, and Opera, and a rapidly growing number of sites, both large (e.g. Google, Twitter, Facebook) and small, were deploying SPDY within their infrastructure. In effect, SPDY was on track to become a de facto standard through growing industry adoption.

Read more…


Deploy Continuous Improvement

Balancing the work it takes to improve capability against delivery work that provides value to customers.

Download a free copy of Building an Optimized Business, a curated collection of chapters from the O’Reilly Web Operations and Performance library. This post is an excerpt by Jez Humble, Joanne Molesky, and Barry O’Reilly from Lean Enterprise, one of the selections included in the curated collection.

In most enterprises, there is a distinction between the people who build and run software systems (often referred to as “IT”) and those who decide what the software should do and make the investment decisions (often called “the business”). These names are relics of a bygone age in which IT was considered a cost necessary to improve efficiencies of the business, not a creator of value for external customers by building products and services. These names and the functional separation have stuck in many organizations (as has the relationship between them, and the mindset that often goes with the relationship). Ultimately, we aim to remove this distinction. In high-performance organizations today, people who design, build, and run software-based products are an integral part of business; they are given — and accept — responsibility for customer outcomes. But getting to this state is hard, and it’s all too easy to slip back into the old ways of doing things.

Read more…


How to leverage the browser cache with a CDN

An introduction to multi-level caching.


Since a content delivery network (CDN) is essentially a cache, you might be tempted not to make use of the cache in the browser, to avoid complexity. However, each cache has its own advantages that the other does not provide. In this post I will explain what the advantages of each are, and how to combine the two for the most optimal performance of your website.

Why use both?

While CDNs do a good job of delivering assets very quickly, they can’t do much about users who are out in the boonies and barely have a single bar of reception on their phone. As a matter of fact, in the US, the 95th percentile for the round trip time (RTT) to all CDNs is well in excess of 200 milliseconds, according to Cedexis reports. That means at least 5% of your users, if not more, are likely to have a slow experience with your website or application. For reference, the 50th percentile, or median, RTT is around 45 milliseconds.

So why bother using a CDN at all? Why not just rely on the browser cache?

Read more…


What is DevOps (yet again)?

Empathy, communication, and collaboration across organizational boundaries.

Cropped image "Kilobot robot swarm" by asuscreative - Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons.http://commons.wikimedia.org/wiki/File:Kilobot_robot_swarm.JPG#mediaviewer/File:Kilobot_robot_swarm.JPG

I might try to define DevOps as the movement that doesn’t want to be defined. Or as the movement that wants to evade the inevitable cargo-culting that goes with most technical movements. Or the non-movement that’s resisting becoming a movement. I’ve written enough about “what is DevOps” that I should probably be given an honorary doctorate in DevOps Studies.

Baron Schwartz (among others) thinks it’s high time to have a definition, and that only a definition will save DevOps from an identity crisis. Without a definition, it’s subject to the whims of individual interest groups, and ultimately might become a movement that’s defined by nothing more than the desire to “not be like them.” Dave Zwieback (among others) says that the lack of a definition is more of a blessing than a curse, because it “continues to be an open conversation about making our organizations better.” Both have good points. Is it possible to frame DevOps in a way that preserves the openness of the conversation, while giving it some definition? I think so.

DevOps started as an attempt to think long and hard about the realities of running a modern web site, a problem that has only gotten more difficult over the years. How do we build and maintain critical sites that are increasingly complex, have stringent requirements for performance and uptime, and support thousands or millions of users? How do we avoid the “throw it over the wall” mentality, in which an operations team gets the fallout of the development teams’ bugs? How do we involve developers in maintenance without compromising their ability to release new software?

Read more…

Comments: 2

If it weren’t for the people…

A humanist approach to automation.

load_of_bricksEditor’s note: At some point, we’ve all read the accounts in newspapers or on blogs that “human error” was responsible for a Twitter outage, or worse, a horrible accident. Automation is often hailed as the heroic answer, poised to eliminate the specter of human error. This guest post from Steven Shorrock, who will be delivering a keynote speech at Velocity in Barcelona, exposes human error as dangerous shorthand. The more nuanced way through involves systems thinking, marrying the complex fabric of humans and the machines we work with every day.

In Kurt Vonnegut’s dystopian novel ‘Player Piano’, automation has replaced most human labour. Anything that can be automated, is automated. Ordinary people have been robbed of their work, and with it purpose, meaning and satisfaction, leaving the managers, scientists and engineers to run the show. Dr Paul Proteus is a top manager-engineer at the head of the Ilium Works. But Proteus, aware of the unfairness of the situation for the people on the other side of the river, becomes disillusioned with society and has a moral awakening. In the penultimate chapter, Paul and his best friend Finnerty, a brilliant young engineer turned rogue-rebel, reminisce sardonically: “If only it weren’t for the people, the goddamned people,” said Finnerty, “always getting tangled up in the machinery. If it weren’t for them, earth would be an engineer’s paradise.

Read more…


Exploring lightweight monitoring systems

Toward unifying customer behavior and operations metrics.

lightweight_systemsFor the last ten years I’ve had a foot in both the development and operations worlds. I stumbled into the world of IT operations as a result of having the most UNIX skills in the team shortly after starting at ThoughtWorks. I was fortunate enough to do so at a time when many of my ThoughtWorks colleagues and I where working on the ideas which were captured so well in Jez Humble and Dave Farley’s Continuous Delivery (Addison-Wesley).

During this time, our focus was on getting our application into production as quickly as possible. We were butting up against the limits of infrastructure automation and IaaS providers like Amazon were only in their earliest form.

Recently, I have spent time with operations teams who are most concerned with the longer-term challenges of looking after increasingly complex ecosystems of systems. Here the focus is on immediate feedback and knowing if they need to take action. At a certain scale, complex IT ecosystems can seem to exhibit emergent behavior, like an organism. The operations world has evolved a series of tools which allow these teams to see what’s happening *right now* so we can react, keep things running, and keep people happy.

At the same time, those of us who spend time thinking about how to quickly and effectively release our applications have become preoccupied with wanting to know if that software does what our customers want once it gets released. The Lean Startup movement has shown us the importance of putting our software in front of our customers, then working out how they actually use it so we can determine what to do next. In this world, I was struck by the shortcomings of the tools in this space. Commonly used web analytics tools, for example, might only help me understand tomorrow how my customers used my site today.

Read more…


Making Systems Operable

Velocity 2013 Speaker Series

There’s an old joke about the aviation cockpit of the future that it will contain just a pilot and a dog. The pilot will be there to watch the automation. The dog will be there to bite the pilot if he tries to touch anything.

Although they will all deny it, the majority of modern IT developers have exactly this view of automation: the system is designed to be self regulating and operators are there to watch it, not to operate it. The result is current systems are often inoperable, i.e. systems they cannot be effectively operated because their functions and capacities are hidden or inaccessible.

The conceit in the pilot-and-the-dog joke is that modern systems do not require operation, that they are autonomous. Whenever these systems are exhibited, our attention is drawn to their autonomous features. But there are no systems that actually function without operators. Even when we claim they are “unmanned”, all important systems have operators who are intimately involved in their function: UAV’s are piloted, the Mars rover is driven, the satellites are managed, surgical robots are manipulated, insulin pumps are programmed. We do not see these activities–many are performed by workers who remain anonymous–but we depend on them.

Read more…

Comments: 2

Efficient, Effective Communication Still Often Elusive

In the operational environment, miscommunication can be costly; but there are some easy ways to improve it.

Editor’s note: This is part two in a four-part series on the “-ations” of aviation that can provide further insight into DevOps best practices and achieving them. Part one, on how standardization helps organizations scale and is actually a part of healthy DevOps culture, can be read here.

Communication is an enigmatic topic when it comes to engineering. Parts of our jobs—blueprints, chemical formulae, and source code—require extremely precise forms of communication (even if it doesn’t end up communicating to the steel, molecules, or silicon what we intended). But when it comes to email threads sifting through requirements, meetings about implementation styles and risk assessment, and software design documentation, we often fumble.

Let’s face it: there’s a reason the “engineer equals bad communicator” stereotype exists. But there are some simple things that can be done, both individually and technologically, to begin challenging that stereotype.

Dual Navigation Receivers Required

There are obviously many forms of communication. In an operational context, it’s useful to distinguish between static and active communication.

Read more…


Surfacing anomalies and patterns in Machine Data

Compelling large-scale data platforms originate from the world of IT Operations

I’ve been noticing that many interesting big data systems are coming out of IT operations. These are systems that go beyond the standard “capture/measure, display charts, and send alerts”. IT operations has long been a source of many interesting big data1 problems and I love that it’s beginning to attract the attention2 of many more data scientists and data engineers.

It’s not surprising that many of the interesting large-scale systems that target time-series and event data have come from ops teams: in an earlier post on time-series, several of the tools I highlighted came out of IT operations. IT operations involves monitoring many different hardware and software systems, a task that requires a variety of tools and which quickly leads to “metrics overload”. A partial list includes data captured from a wide range of application log files, network traffic, energy and power sources.

The volume of IT ops data has led to new tools like OpenTSDB and KairosDB – time series databases that leverage HBase and Cassandra. But storage, simple charts, and lookups are just the foundation of what’s needed. IT Ops track many interdependent systems, some of which might be correlated3. Not only are IT ops faced with highlighting “unknown unknowns” in their massive data sets, they often need to do so in near realtime.

Read more…