Beyond the stack

The tools in the Distributed Developer's Stack make development manageable in a highly distributed environment.

Cairn at Garvera, Surselva, Graubuenden, Switzerland. The shape of software development has changed radically in the last two decades. We’ve seen many changes: the Internet, the web, virtualization, and cloud computing. All of these changes point toward a fundamental new reality: all computing has become distributed computing. The age of standalone applications has disappeared, and applications that run on a single computer are almost inconceivable. Distributed is the default; and whether an application is running on Amazon Web Services (AWS), on a private cloud, or even on a desktop or a mobile phone, it depends on the behavior of other systems and services that aren’t under the developer’s control.

In the past few years, a new toolset has grown up to support the development of massively distributed applications. We call this new toolset the Distributed Developer’s Stack (DDS). It is orthogonal to the more traditional world of servers, frameworks, and operating systems; it isn’t a replacement for the aged LAMP stack, but a set of tools to make development manageable in a highly distributed environment.

The DDS is more of a meta-stack than a “stack” in the traditional sense. It’s not prescriptive; we don’t care whether you use AWS or OpenStack, whether you use Git or Mercurial. We do care that you develop for the cloud, and that you use a distributed version control system. The DDS is about the requirements for working effectively in the second decade of the 21st century. The specific tools have evolved, and will continue to evolve, and we expect you to evolve, too.

Cloud as platform

AWS has revolutionized software development. It’s simple for a startup to allocate as many servers as it needs, tailored to its requirements, at low cost. A developer at an established company can short-circuit traditional IT procurement channels, and assemble a server farm in minutes using nothing more than a credit card.

Even applications that don’t use AWS or some other cloud implementation are distributed. The simplest web page requires a server, a web browser to view it, DNS servers for hostname resolution, and any number of switches and routers to move bits from one place to another. A web application that’s only slightly more complex relies on authentication servers, databases, and other web services for real-time data. All these are externalities that make even the simplest application into a distributed system. A power outage, router failure, or even a bad cable in a city you’ve never heard of, can take your application down.

I’m not arguing that the sky is falling because … cloud. But it is critically important to understand what the cloud means for the systems we deploy and operate. As the number of systems involved in an application grows, the number of failure modes grows combinatorially. An application running over 10 servers isn’t 10 times as complex as an application running on a single server; it’s thousands of times more complex.

The cloud is with us to stay. Whether it’s public or private, AWS, OpenStack, Microsoft Azure, or Google Compute Engine, applications will run in the cloud for the foreseeable future. We have to deal with it.

Development as a distributed process

We’ve made many advances in source control over the years, but until recently we’ve never dealt with the fact that software development itself is distributed. Our models have been based on the idea of lone “programmers” writing monolithic “programs” that run on isolated “machines.” We have had build tools, source control archives, and other tools to make the process easier, but none of these tools really recognize that projects require teams. Developers would work on their part of the project, then try to resolve the mess in a massive “integration” stage in which all the separate pieces are assembled.

The version control system Git recognizes that a team of developers is fundamentally a distributed system, and that the natural process of software development is to create branches, or forks, then merge those branches back into a master repository. All developers have their own local codebase, branching from master. When they’re ready, they merge their their changes and push them back to master; at this point, other members of the team can pull the changes to update their own code bases. Each developer’s work is decoupled from others; team members can work asynchronously, distributed in time as well as in space.

Continuous integration tools like Jenkins and its predecessor, Hudson, were among the first tools to recognize the paradigm shift. Continuous integration reflects the reality that, when development is distributed, integrating the work of all the developers has to be a constant process. It can’t be postponed until a major release is finished. It’s important to move forward in small, incremental steps, making sure that the project always builds and works.

Facilitating collaboration on a team of distributed developers will never be a simple problem. But it’s a problem that becomes much more tractable with tools that recognize the nature of distributed development, rather than trying to maintain the myth of the solitary programmer.

Infrastructure as code

Infrastructure as code has been a slogan at the Velocity Conference for some years now. But what does that mean?

Cloud computing lets developers allocate servers as easily as they allocate memory. But as any ’90s sysadmin knows, the tough part isn’t taking the server out of the box, it’s getting it set up and configured correctly. And that’s a pain whether you’re sitting at a console terminal with a green screen or ssh’ed into a virtual box a thousand miles away. It’s an even bigger pain when you’ve grown from a single server or a small cluster to hundreds of AWS nodes distributed around the world.

In the last decade, we’ve seen a proliferation of tools to solve this problem. Chef, Puppet, CFEngine, Ansible, and other tools capture system configurations in scripts, automating the configuration of computer systems, whether physical or virtual. The ability to allocate machines dynamically and configure them automatically changes our relationship to computing resources. In the old days, when something went wrong, a sysadmin had to nurse the system back to health, whether by rebooting, reinstalling software, replacing a disk drive, or something else. When something was broken, you had to fix it. That still may be true of our laptops or phones, but it’s no longer true of our production infrastructure. If something goes wrong with a server on AWS, you delete it, and start another one. It’s easier, simpler, quicker, cheaper. A small operations staff can manage thousands, or tens of thousands, of servers. With the appropriate monitoring tools, it’s even possible to automate the process of identifying a malfunctioning server, stopping it, deleting it, and allocating a new one.

If configuration is code, then configuration must be considered part of the software development process. It’s not enough to develop software on your laptop, and expect operations staff to build systems on which to deploy. Development and deployment aren’t separate processes; they’re two sides of the same thing.

Containerization as deployment

Containers are the most recent addition to the stack. Containers go a step beyond virtualization: a system like Docker lets you build a package that is exactly what you need to deploy your software: no more, and no less. This package is analogous to the standard shipping container that revolutionized transportation several decades ago. Rather than carefully loading a transport ship with pianos, nuts, barrels of oil, and what have you, these things are stacked into standard containers that are guaranteed to fit together, that can be loaded and unloaded easily, placed not only onto the ship but also onto trucks and trains, and never opened until they reach their destination.

Containers are special because they always run the same way. You can package your application in a Docker container and run it on your laptop; you can ship it to Amazon and run it on an AWS instance; you can ship it to a private OpenStack cloud and run it there; you can even run it in on a server in your machine room, if you still have one. The container has everything needed to run the code correctly. You don’t have to worry about someone upgrading the operating system, installing a new version of Apache or nginx, replacing a library with a “better” version, or any number of things that can result in unpleasant surprises. Of course, you’re now responsible for keeping your containers patched with the latest operating systems and libraries; you can’t rely on the sysadmins. But you’re in control of the process: your software will always run in exactly the environment you specify. And given the many ways software can fail in a distributed environment, eliminating one source of failure is a good thing.

Monitoring as testing

In a massively distributed system, software can fail in many ways that you can’t test for. Test-driven development won’t tell you how your applications will respond when a router fails. No acceptance test will tell you how your application will perform under a load that’s 1,000 times the maximum you expected. Testing may occasionally flush out a race condition that you hadn’t noticed, but that’s the exception rather than the rule.

Netflix’ Chaos Monkey shows how radical the problem is. Because systematic testing can never find all the problems in a distributed system, Netflix resorts to random vandalism. Chaos Monkey (along with other members of Netflix’ Simian Army) periodically terminates random services in Netflix’ AWS cloud, potentially causing failures in their production systems. These failures mostly go unnoticed, because Netflix developers have learned to build systems that are robust and resilient in the face of failure. But on occasion, Chaos Monkey reveals a problem that probably couldn’t have been discovered through any other means.

Monitoring is the next step beyond testing; it’s really continuous run-time testing for distributed systems where testing is impossible. Monitoring tools such as Riemann, statsd, and Graphite tell you how your systems are handling real-world conditions. They’re the tools that let you know if a router has failed, if your servers have died, or if they’re not holding up under an unexpected load. Back in the ’60s and ’70s, computers periodically “crashed,” and system administrators would scurry around figuring out what happened and getting them re-booted. We no longer have the luxury of waiting for failures to happen, then guessing about what went wrong. Monitoring tools enable us to see problems coming, and when necessary, to analyze what happened after the fact.

Monitoring also lets the developer understand what features are being used, and which are not, and applications that are deployed as cloud services lend themselves easily to A/B testing. Rather than designing a monolithic piece of software, you start with what Eric Ries calls a minimum viable product—the smallest possible product that will give you validated learning about what the customer really wants and responds to—and then build out from there. You start with a hypothesis about user needs, and constantly measure and learn how better to meet those needs. Software design itself becomes iterative.

Is this DevOps?

No. The DDS stack is about the tools for working in a highly distributed environment. These tools are frequently used by people in the DevOps movement, but it’s important not to mistake the tools for the substance. DevOps is about the culture of software development, starting with developers and operations staff, but in a larger sense, across companies as a whole. Perhaps the best statement of that is Velocity speaker Jeff Sussna’s (@jeffsussna) post Empathy: The Essence of DevOps.

Most globally, DevOps is about the realization that software development is a business process, all businesses are software businesses, and all businesses are ultimately human enterprises. To mistake the tools for the cultural change is the essence of cargo culting.

The CIO of Fidelity Investments once remarked to Tim O’Reilly: “We know about all the latest software development tools. What we don’t know is how to organize ourselves to use them.” DevOps is part of the answer to that business question: how should the modern enterprise be organized to take advantage of the way software systems work now? But it’s not just integration of development and IT operations. It’s also integration of development and marketing, business modeling and measurement, and, in a public sector context, policy making and implementation.

Why now?

All software is “web software,” even the software that doesn’t look like web software. We’ve become used to gigantic web applications running across millions of servers; Google and Facebook are in the forefront of our consciousness. But the web has penetrated to surprising places. You might not think of enterprise applications as “web software,” but it’s increasingly common for internal enterprise applications to have a web interface. The fact that it’s all behind a firewall is irrelevant.

Likewise, we’ve heard many times that mobile is the future, and the web is dead. Maybe, if “the web” means Firefox and Chrome. But the first time the web died, Nat Torkington (@gnat) said: “I’ve heard that the Web is dead. But all the applications that have killed it are accessing services using HTTP over port 80.” A small number of relatively uninteresting mobile applications are truly standalone, but most of them are accessing data services. And those services are web services; they’re using HTTP, running on Apache, and pushing JSON documents around. Dead or not, the web has won.

The web has done more than win, though. The web has forced all applications to become distributed. Our model is no longer Microsoft Word, Adobe InDesign, or even the original VMWare. We’re no longer talking products in shrink-wrapped boxes, or even enterprise software delivered in massive deployments, we’re talking products like Gmail and Netflix that are updated and delivered in real-time from thousands of servers. These products rely on services that aren’t under the developer’s control, they run on servers that are spread across many data centers on all continents, and they run on a dizzying variety of platforms.

The future of software development is bound up with distributed systems, and all the complexity and indeterminacy that entails. We’ve started to develop the tools necessary to make distributed systems tractable. If you’re part of a software development or operations team, you need to know about them.

Photo: Cairn at Garvera, Surselva, Graubuenden, Switzerland. Licensed under Wikimedia Commons


tags: , ,