"operations" entries

Exploring lightweight monitoring systems

Toward unifying customer behavior and operations metrics.

lightweight_systemsFor the last ten years I’ve had a foot in both the development and operations worlds. I stumbled into the world of IT operations as a result of having the most UNIX skills in the team shortly after starting at ThoughtWorks. I was fortunate enough to do so at a time when many of my ThoughtWorks colleagues and I where working on the ideas which were captured so well in Jez Humble and Dave Farley’s Continuous Delivery (Addison-Wesley).

During this time, our focus was on getting our application into production as quickly as possible. We were butting up against the limits of infrastructure automation and IaaS providers like Amazon were only in their earliest form.

Recently, I have spent time with operations teams who are most concerned with the longer-term challenges of looking after increasingly complex ecosystems of systems. Here the focus is on immediate feedback and knowing if they need to take action. At a certain scale, complex IT ecosystems can seem to exhibit emergent behavior, like an organism. The operations world has evolved a series of tools which allow these teams to see what’s happening *right now* so we can react, keep things running, and keep people happy.

At the same time, those of us who spend time thinking about how to quickly and effectively release our applications have become preoccupied with wanting to know if that software does what our customers want once it gets released. The Lean Startup movement has shown us the importance of putting our software in front of our customers, then working out how they actually use it so we can determine what to do next. In this world, I was struck by the shortcomings of the tools in this space. Commonly used web analytics tools, for example, might only help me understand tomorrow how my customers used my site today.

Read more…

Making Systems Operable

Velocity 2013 Speaker Series

There’s an old joke about the aviation cockpit of the future that it will contain just a pilot and a dog. The pilot will be there to watch the automation. The dog will be there to bite the pilot if he tries to touch anything.

Although they will all deny it, the majority of modern IT developers have exactly this view of automation: the system is designed to be self regulating and operators are there to watch it, not to operate it. The result is current systems are often inoperable, i.e. systems they cannot be effectively operated because their functions and capacities are hidden or inaccessible.

The conceit in the pilot-and-the-dog joke is that modern systems do not require operation, that they are autonomous. Whenever these systems are exhibited, our attention is drawn to their autonomous features. But there are no systems that actually function without operators. Even when we claim they are “unmanned”, all important systems have operators who are intimately involved in their function: UAV’s are piloted, the Mars rover is driven, the satellites are managed, surgical robots are manipulated, insulin pumps are programmed. We do not see these activities–many are performed by workers who remain anonymous–but we depend on them.

Read more…

Efficient, Effective Communication Still Often Elusive

In the operational environment, miscommunication can be costly; but there are some easy ways to improve it.

Editor’s note: This is part two in a four-part series on the “-ations” of aviation that can provide further insight into DevOps best practices and achieving them. Part one, on how standardization helps organizations scale and is actually a part of healthy DevOps culture, can be read here.

Communication is an enigmatic topic when it comes to engineering. Parts of our jobs—blueprints, chemical formulae, and source code—require extremely precise forms of communication (even if it doesn’t end up communicating to the steel, molecules, or silicon what we intended). But when it comes to email threads sifting through requirements, meetings about implementation styles and risk assessment, and software design documentation, we often fumble.

Let’s face it: there’s a reason the “engineer equals bad communicator” stereotype exists. But there are some simple things that can be done, both individually and technologically, to begin challenging that stereotype.

Dual Navigation Receivers Required

There are obviously many forms of communication. In an operational context, it’s useful to distinguish between static and active communication.

Read more…

Surfacing anomalies and patterns in Machine Data

Compelling large-scale data platforms originate from the world of IT Operations

I’ve been noticing that many interesting big data systems are coming out of IT operations. These are systems that go beyond the standard “capture/measure, display charts, and send alerts”. IT operations has long been a source of many interesting big data1 problems and I love that it’s beginning to attract the attention2 of many more data scientists and data engineers.

It’s not surprising that many of the interesting large-scale systems that target time-series and event data have come from ops teams: in an earlier post on time-series, several of the tools I highlighted came out of IT operations. IT operations involves monitoring many different hardware and software systems, a task that requires a variety of tools and which quickly leads to “metrics overload”. A partial list includes data captured from a wide range of application log files, network traffic, energy and power sources.

The volume of IT ops data has led to new tools like OpenTSDB and KairosDB – time series databases that leverage HBase and Cassandra. But storage, simple charts, and lookups are just the foundation of what’s needed. IT Ops track many interdependent systems, some of which might be correlated3. Not only are IT ops faced with highlighting “unknown unknowns” in their massive data sets, they often need to do so in near realtime.

Read more…

Zero Downtime Application Updates with Ansible

OSCON 2013 Speaker Series

Automating the configuration management of your operating systems and the rollout of your applications is one of the most important things an administrator or developer can do to avoid surprises when updating services, scaling up, or recovering from failures. However, it’s often not enough. Some of the most common operations that happen in your datacenter (or cloud environment) involve large numbers of machines working together and humans to mediate those processes. While we have been able to remove a lot of human effort from configuration, there has been a lack of software able to handle these higher-level operations.

I used to work for a hosted web application company where the IT process for executing an application update involved locking six people in a room for sometimes 3-4 hours, each person pressing the right buttons at the right time. This process almost always had a glitch somewhere where someone forgot to run the right command or something wasn’t well tested beforehand. While some technical solutions were applied to handle configuration automation, nothing that could perform configuration could really accomplish that high level choreography on top as well. This is why I wrote Ansible.

Ansible is a configuration management, application deployment, and IT orchestration system. One of Ansible’s strong points is having a very simple, human readable language – it allows users very fine, precise control over what happens on what machines at what times.

Getting started

To get started, create an inventory file, for instance, ~/ansible_hosts that defines what machines you are managing, and which machines are frequently organized into groups. Ansible can also pull inventory from multiple cloud sources, but an inventory file is a quick way to get started:

[webservers]
www01.example.com
www02.example.com
# add more webservers here

[monitoring]
nagios1.example.com

[lbservers]
haproxy1.example.com
haproxy2.example.com

Now that you have defined what machines you are managing, you have to define what you are going to do on the remote machines.

Ansible calls this description of processes a “playbook,” and you don’t have to have just one, you could have different playbooks for different kinds of tasks.

Let’s look at an example for describing a rolling update process. This example is somewhat involved because it’s using haproxy, but haproxy is freely available. Ansible also includes modules for dealing with Netscalers and F5 load balancers, so this is just an example — ordinarily you would start more simply and work up to an example like this:
Read more…

Ops Mythology

Velocity 2013 Speaker Series

At some point, we’ve all ended up trading horror stories over drinks with colleagues. Heads nod and shake in sympathy, and the stories get hairier as the night goes on. And while it of course feels good to get some of that dirt off your shoulder, is there a larger, better purpose to sharing war stories? I sat down with James Turnbull of Puppet Labs (@kartar) to chat about his upcoming Velocity talk about Ops mythology, and how we might be able to turn our tales of disaster into triumph.

Key highlights of our discussion include:

  • Why do we share disaster stories? What is the attraction? [Discussed at 0:40]
  • Stories are about shared experience and bonding with members of our community. [Discussed at 2:10]
  • These horror stories are like mythological “big warnings” that help enforce social order, which isn’t always a good thing. [Discussed at 4:18]
  • A preview of how his talk will be about moving away from the bad stories so people can keep telling more good stories. (Also: s’mores.) [Discussed at 7:15]

You can watch the entire interview here:

This is one of a series of posts related to the upcoming Velocity conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.

Kate Matsudaira: If You Don’t Understand People, You Don’t Understand Ops

Velocity 2013 Speaker Series

While automation is clearly making everyone’s lives who work in Operations much better, startup founder Kate Matsudaira (@katemats) acknowledges that “No one ever does their work in a vaccum.” You can try as much as possible to Automate All The Things, but you can’t automate trust. And trust is key to a healthy, thriving operations team (and your own professional growth, too).

In this interview, Kate discusses some of the things she’ll be talking about at Velocity next month. Key highlights include:

  •  The word “people” is pretty broad. What aspects of working with people should operations teams care about? [Discussed at 1:32]
  • Ultimately, you depend on the people around you to help get work done, especially when you need to get funding, be it externally for a startup, or internally for an infrastructure or refactoring project. The more people trust you, the more likely that is to happen. [Discussed at 3:17]
  • Cultural change takes leadership, but that leadership doesn’t have to come from the top. [Discussed at 5:00]
  • You can be ridiculously technically competent, but if you can’t communicate well, it hinders your success in the long run. [Discussed at 5:40]

You can view the entire interview here:

This is one of a series of posts related to the upcoming Velocity Conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.

 

Velocity Profile: Schlomo Schapiro

Web ops and performance questions with Schlomo Schapiro.

A profile of web operations and performance expert Schlomo Schapiro, systems architect and open source evangelist at ImmobilienScout24.

What is DevOps?

What we mean by "operations," and how it's changed over the years.

NoOps, DevOps — no matter what you call it, operations won’t go away. Ops experts and development teams will jointly evolve to meet the challenges of delivering reliable software to customers.

Velocity Profile: Kate Matsudaira

Web ops and performance questions with Kate Matsudaira.

A profile of web operations and performance expert Kate Matsudaira, vice president of engineering at Decide.com.