Continuous deployment in 5 easy steps

One of the lean startup techniques I’ll be discussing at this week’s session at the Web 2.0 Expo is called continuous deployment. It’s a process whereby all code that is written for an application is immediately deployed into production. The result is a dramatic lowering of cycle time and freeing up of individual initiative. It has enabled companies I’ve worked with to deploy new code to production as often as fifty times every day.

Continuous deployment is controversial. Most people, when they first hear about continuous deployment, think I’m advocating low-quality code or an undisciplined cowboy-coding development process. On the contrary, I believe that continuous deployment requires tremendous discipline and can greatly enhance software quality, by applying a rigorous set of standards to every change to prevent regressions, outages, or harm to key business metrics. (This criticism is a variation of the “time, quality, money – pick two” fallacy)

Another common reaction I hear to continuous deployment is that it’s too complicated, time-consuming, or hard to prioritize. It’s this latter fear that I’d like to address head-on in this post. While it is true that the full system we use to support deploying fifty times a day at IMVU is elaborate, it certainly didn’t start that way. By making a few simple investments and process changes, any development team can be on their way to continuous deployment. It’s the journey, not the destination, that counts. Here’s the why and how, in five steps.

  1. Continuous integration server. This is the backbone of continuous deployment. We need a centralized place where all automated tests (unit tests, functional tests, integration tests, everything) can be run and monitored upon every commit. There are many fine free software tools to make this easy – I have had success with BuildBot. Whatever tool you use, it’s important that it be able to run all the tests your organization writes, in all languages and frameworks.

    If you only have a few tests (or even none at all), don’t despair. Simply set up the CI server and agree to one simple rule: we’ll add a new automated test every time we fix a bug. Following that rule will start to immediately get testing where it’s needed most: in the parts of your code that have the most bugs and, therefore, drive the most waste for your developers. Even better, these tests will start to pay immediate dividends by propping up that most-unstable code and freeing up a lot of time that used to be devoted to finding and fixing regressions (aka “firefighting”).

    If you already have a lot of tests, make sure that the total time the CI server spends on a full run is a small amount of time, 10-30 minutes at the maximum. If that’s not possible, simply partition the tests across multiple machines until you get the time down to something reasonable.

    For more on the nuts-and-bolts of setting up continuous integration, see Continuous integration step-by-step.

  2. Source control commit check. The next piece of infrastructure we need is a source control server with a commit-check script. I’ve seen this implemented with CVS, subversion or Perforce and have no reason to believe it isn’t possible in any source control system. The most important thing is that you have the opportunity to run custom code at the moment a new commit is submitted but before it is accepted by the server. Your script should have the power to reject a change and report a message back to the person attempting to check in. This is a very handy place to enforce coding standards, especially those of the mechanical variety.

    But its role in continuous deployment is much more important. This is the place you can control what I like to call “the production line” to borrow a metaphor from manufacturing. When something is going wrong with our systems at any place along the line, this script should halt new commits. So if the CI server runs a build and even one test breaks, the commit script should prohibit new code from being added to the repository. In subsequent steps, we’ll add additional rules that also “stop the line,” and therefore halt new commits.

    This sets up the first important feedback loop that you need for continuous deployment. Our goal as a team is to work as fast as we can reliably produce high-quality code – and no faster. Going any “faster” is actually just creating delayed waste that will slow us down later. This feedback loop is also discussed in detail elsewhere.

  3. Simple deployment script. At IMVU, we built a serious deployment script that incrementally deploys software machine-by-machine and monitors the health of the cluster and the business along the way so that it can do a fast-revert if something looks amiss. We call it a cluster immune system. But we didn’t start out that way. In fact, attempting to build a complex deployment system like that from scratch is a bad idea.

    Instead, start simple. It’s not even important that you have an automated process, although as you practice you will get more automated over time. Rather, it’s important that you do every deployment the same way and have a clear and published process for how to do it that you can evolve over time.

    For most websites, I recommend starting with a simple script that just rsync’s code to a version-specific directory on each target machine. If you are facile with unix symlinks, you can pretty easily set this up so that advancing to a new version (and, hence, rolling back) is as easy as switching a single symlink on each server. But even if that’s not appropriate for your setup, have a single script that does a deployment directly from source control.

    When you want to push new code to production, require that everyone use this one mechanism. Keep it manual, but simple, so that everyone knows how to use it. And, most importantly, have it obey the same “production line” halting rules as the commit script. That is, make it impossible to do a deployment for a given revision if the CI server hasn’t yet run and had all tests pass for that revision.

  4. Real-time alerting. No matter how good your deployment process, bugs can still get through. The most annoying variety are bugs that don’t manifest until hours or days after the code that caused them is deployed. To catch those nasty bugs, you need a monitoring platform that can let you know when things have gone awry, and get a human being involved in debugging them.

    To start, I recommend a system like the open source nagios. Out of the box, it can monitor basic system stats like load average and disk utilization. For continuous deployment purposes, we want to be able to have it monitor business metrics like simultaneous users or revenue per unit time. At the beginning, simply pick one or two of these metrics to use. Anything is fine to start, and it’s important not to choose too many. The goal should be to wire the nagios alerts up to a pager, cell phone, or high-priority email list that will wake someone up in the middle of the night if one of these metrics goes out of bounds. If the pager goes off too often, it won’t get the attention it deserves, so start simple.

    Follow this simple rule: every time the pager goes off, halt the production line (which will prevent checkins and deployments). Fix the urgent problem, and don’t resume the production line until you’ve had a chance to schedule a five whys meeting for root cause analysis, which we’ll discuss next.

  5. Root cause analysis (five whys). So far, we’ve talked about making modest investments in tools and infrastructure and adding a couple of simple rules to our development process. Most teams should be able to do everything we’ve talked about in a week or two, at the most, since most of the work is installing and configuring off-the-shelf software.

    Five whys is not something you can get in a box. It’s a powerful practice that is the motive force that will drive major improvements in development process incrementally, one step at a time. I’ve described it in detail in my post Five Whys, and will only summarize it here.

    The idea is to always get to the root cause of any unexpected failure in the system. A test failing, a nagios alert firing, or a customer seeing a new bug are all sufficient triggers for root cause analysis. That’s why we always shut down the production line for problems of this kind – it signals the need for root cause analysis and also creates the time and space for it to happen (since it deliberately slows the whole team down).

    Five whys gets its name from the process of asking “why” recursively to uncover the true source of a given problem. The way five whys works to enable continuous deployment is when you add this rule: every time you do a root cause analysis, make a proportional investment in prevention at each of the five levels you uncover. Proportional means that the solution shouldn’t be more expensive than the problem you’re analyzing; a minor inconvenience for only a few customers should merit a much smaller investment than a multi-hour outage.

    But no matter how small the problem, always make some investments, and always make them at each level. Since our focus in this post is deployment that means always asking the question “why was this problem not caught earlier in our deployment pipeline?” So if a customer experienced a bug, why didn’t nagios alert us? Why didn’t our deployment process catch it? Why didn’t our continuous integration server catch it? For each question, make a small improvement.

    Over months and years, these small improvements add up, much like compounding interest. But there is a reason this approach is superior to making a large up-front investment in a complex continuous deployment system modeled on IMVU’s (or anyone else’s). The payoff is that your system will be uniquely adapted to your particular system and circumstance. If most of your headaches come from performance problems in production, then you’ll naturally be forced to invest in prevention at the deployment/alerting stage. If your problems stem from badly factored code, which causes collateral damage for even small features or fixes, you’ll naturally find yourself adding a lot of automated tests to your CI server. Each problem drives investments in that category of solution. Thankfully there’s an 80/20 rule at work: 20% of your code and architecture probably drives 80% of your headaches. Investing in that 20% frees up incredible time and energy that can be invested in more productive things.

Following these five steps will not give you continuous deployment overnight. In its initial stages, most of your root cause analysis will come back to the same problem: “we haven’t invested in preventing that yet.” But with patience and hard work, anyone can use these techniques to inexorably drive waste out of their development process.

Continuous deployment is only one tool in the lean startup arsenal – to learn more about it and to see how it relates with other techniques, join us for “The Lean Startup: a Disciplined Approach to Imagining, Designing, and Building New Products” on April 1st at Web 2.0 Expo. And thanks to web2open, you can also join for an in-depth discussion immediately afterward. Best of all, you can attend both events for free (click here for details).

  • http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Deployment Steve Loughran

    I’ve played a lot with Continuous Deployment infrastructure, and as you say, automation is essential. I’m not sure I’d push unix/perl scripts though, they don’t scale and aren’t that robust unless well written. Better to use -any- Configuration Management tool, from CfEngine to bcfg2, Puppet or SmartFrog. There is time spent up front learning to use them, but they pay off long term.

    another thing to consider is if your cluster is outsourced on some cloud computing datacentre, you can do live switchover/rollback by bringing up the staging version as a new set of VMs, test it and then, if you are happy, flip the public IPAddr/hostname to point to the new version. And flip back to the old one if there is trouble. This is still pretty leading-edge, but it could be very powerful.

  • Scott Carlson

    With the commit-check how do you fix a broken build? Do you force a check for a specific tag in the build message?

  • http://startuplessonslearned.blogspot.com/ Eric Ries

    Scott, yes that’s the basic idea. I normally use a structured comment in the commit message, so that it is automatically published to everyone who subscribes to changes. Alternatives are using the source control system’s built-in metadata or allowing checkins from the CI server itself. It’s also good to have a manual override, as long as using it triggers some kind of alert to the rest of the team (it might be time for some root cause analysis).

    Steve, I actually don’t recommend automation or anything fancy here. I do like CF for configuration changes – but generally not for deployment. The important point is that the deployment script is owned by the programming team, not ops, QA or some other group. As long as they feel they can understand and modify the script as they realize new features it needs, it’s OK to build it with any tool available.

  • http://Nidus.ORG Cannon

    If ideas are currency, then text is king and vi is plumbing. What could be faster?

  • http://ronaldbradford.com/blog Ronald Bradford

    For code deployment you can looking into decreasing time to market of new features, however when code deployment introduces either additional load or backend database changes you introduce external factors that cause infrastructure impact.

    For example, additional load such as increased reading or writing of data, causes more database impact, network impact and page load time. These may cause greater bottlenecks to occur.

    If you require database changes and you run the World’s most populate database MySQL, then deployment of features may require downtime, limited functionality being available or degraded performance.

    In these instances, planning, scheduling and often a combined release of a number of features is more effective.

    http://ronaldbradford.com – MySQL Expert | MySQL Performance Tuning | MySQL Scalability

  • http://jts-blog.com JT

    Interesting approach. I like it. But it sounds like you cannot underestimate the need for discipline. If your team is barely understanding the use of subversion, and is calling automated builds w/o *any* unit tests “CI Builds”, then you might have a long road ahead!

    How does change control fit in? I’m coming from the financial industry. With SOX, PCI, SAS-70, etc. – we’ve need to have some type of process in place for the organization to approve changes.

  • http://www.alevin.com Adina Levin

    What does this do for end user experience? Is this used anywhere for end-user features and does the ui / user behavior change incrementally with each push?

  • http://timwolters.blogspot.com Tim Wolters

    First of all, great article on getting up and running and iterating quickly.

    In #3 this obviously doesn’t include DB changes. The best environment I’ve seen for this is Rails Migrations. So your script would double in size to two lines

    1. change symlink
    2. rake db:migrate VERSION=prev_version_num

    you could probably use this even in a non RoR environment. your db would just need to adhere to rails naming conventions.

    And, in addition to the automated monitoring (very important!) is to acknowledge that your customers are doing collective monitoring. Give them extremely easy ways to interact with you when they see trouble (twitter integration, email, forums). And quickly turn these issues around to continually impress and befriend your customers! Loyal customers give you a lot more leniency when it comes to continuous deployment.

  • http://startuplessonslearned.blogspot.com/ Eric Ries

    Thanks for the great suggestions. I especially like the Rails idea.

    DB changes and other “one-way” events that can’t be easily rolled back need a special process for handling them. However, if you follow these five steps, you will quickly discover these exceptions and develop these processes.

    One thing I didn’t mention in the original article is that continuous deployment allows you to separate out two meanings of “release” – the technical sense in which code is fully-integrated and deployed from the marketing sense that it’s something customers see.

    You can develop systems to control which customers see which features. Then you can deploy in realtime without worry, and enable features for customers only when it is optimal for your business – regulatory issues, opt-in procedures, split-tests, etc.

  • Mark

    I cannot stress enough the importance of Cruise Control. Once you get Cruise Control up and running, and tweaked to your desired settings, it becomes very very easy to take care of any deployments.

  • http://westkarana.com Brenda Holloway

    While this is cool for automating builds, I’m troubled by this being an automated route to pushing code through to production. Just because some code compiles and runs doesn’t mean the code should have been written in the first place. Without some sort of review, there’s a huge risk that unexamined code could hit production. This is a particular risk if your company does all the SOx and CMMi stuff, where every code push requires a justification, a paper trail, and an independent audit.

    While this methodology may work for games, I can’t see this working well with applications that could potentially put real people at real risk.

  • http://jts-blog.com JT

    @Mark – I recommend taking a look at Bamboo too. About a year ago we switched over from CC. I’m very happy with Bamboo. The integration with Jira/Fisheye is well done. Atlassian has a nice plugin for Intellij/Eclipse.

  • http://anarchogeek.com rabble

    We did something very similar at odeo.com and the only time we had serious problems with deploys was when i went on vacation for 2 weeks and there wasn’t a single deploy. Normally we had many deploys a day and were quite stable.

  • joe

    i’ve never learned how to properly debug., so i’m constantly launching upgrades that come back to haunt me. it’s all beta until the core functions are sure and solid, then it’s periodic supprises..

    this isn’t for corporations or clients paying boat loads though, no no. web services are a joy and prison simultaneous. i’m all for continuous development and someday will implement and actual plan, till then i guess i’m just a cowboy trying to lasso all the bugs for before branding time.

    one tactic i made up is to throw a scrip up on the live site but only include it if user is logged in as admin or testuser. then i go in and try it out a bunch before sharing it with the other users.

    all the tools above seem daunting to me. is there anything i can just plugin to a php template header or footer somewhere that will notify me when a script fails?, and send over all the vars involved at the time?..

    great article, it’s an important issue.

  • http://orip.org Ori Peleg

    Thanks for the great article, Eric!

    I wonder how much of this varies across project types – sometimes we don’t find a way to make the test suites that complete, and find a manual QA buffer is necessary.

    You’ve given me some ideas, e.g. I’ve never hooked accepting commits to passing the test suites before.

  • http://taint.org/ Justin Mason

    C-D is a great methodology — please keep describing the details of IMVU’s setup!

    @Brenda Holloway:

    ‘without some sort of review, there’s a huge risk that unexamined code could hit production.’

    so you introduce a code review step into the commit workflow. code needs to be reviewed before it can even make it into trunk. (This is how the ASF handles maintainance branches — “Review Then Commit” is the phrase.)

    Also — I’m not sure about halting commits on CI build failure. I can see that resulting in developers “going dark” and doing their work offline, or forking “dev branches” where they can check in code without fear of breaking trunk. It blocks “commit early, commit often”: http://blog.red-bean.com/sussman/?p=96

    Perhaps the best option is a “release branch” which is used as the source of CI builds, and is halted on CI failure?

  • http://www.sqablogs.com/jstrazzere/ Joe Strazzere

    Is deploying to production fifty times per day a good thing?

    If so, why?

    Do the users get confused as the system under them changes every 30 minutes?

  • Farid Sharim

    Just file deployment is kinda easy to handle. Taking care of DB changes (DB connect, execute SQL, maybe undo SQL script) is way different!
    This should be
    Any good ideas/ponts on handling such environments?

  • http://stochasticresonance.wordpress.com Andrew Clay Shafer

    Eric, thanks for sharing your strategies and success.

    I consider this interpretation of ‘Continous Integration’ a subset of my definition that is only appropriate for certain scenarios.

    In my opinion, the goal is to lower the fixed cost of deploying the application which includes the fixed cost of qualifying the application, as much as possible.

    IMVU has decided that qualifying the application is only running automated tests. IMVU has also been criticized for this. I’m not criticizing, because for this type of application the worst case scenario is someone has a slightly unexpected experience while chatting with other people’s virtual avatars. Just realize this is a business decision that might not be appropriate for other types of applications.

    Understanding what the real costs of deployment are for your organization and the technology and processes you can apply to lower that cost provides a lot of power and flexibility.

    With that understanding, you can make decisions about what is appropriate for your application to deliver the most value. That might mean deploying features on checkin, deploying on QA’s checkoff, or in some situations it might align best with the business to queue up features to coordinate with sales and marketing efforts.

    Not every application is a social consumer application, and not everyone can get away with issues like twitter has had. There is no one size fits all strategy, but knowing what is possible gives you options.

  • http://stochasticresonance.wordpress.com Andrew Clay Shafer

    I meant ‘Continuous Deployment’ in that first sentence.

    By the way Eric, great interview with Nivi on Venture Hacks last week.

  • Ever

    Thank you for your input.

    I am all for Continuous Integration. However, I would like to understand why would it be required to “deploy new code to production as often as fifty times every day.”? That’s a total of 250 deployments in one working week!

    If the quality of the code is expected to increase by implementing the suggestions provided in your article, wouldn’t the number of deployments needed to fix bugs would decrease?

    Wouldn’t it be easier to implement a combination of Agile practices and Test Driven Development to ensure that proper testing is performed prior to deployment (unit testing, QA, UAT), as opposed to deploying code to prod, identifying what is broken, building automated testing, deploying again and continue this cycle until all the bugs are fixed?

    Nevertheless, this is my opinion based on my experience. Thank you so much for sharing your ideas. Definitely a very interesting concept.

  • John Tangney

    We’re experimenting with CD. The hardest shift in thinking is the knowledge that what I check in will be live within a few minutes. It means that work-in-progress has to be considered very carefully: It must be 100% compatible with the live system.

    Just getting to CD has been a great journey for me. We have had to focus much harder on automation than we ever had to with Continuous Integration.

  • Geoff

    I’m not sure “easy” should be in this article’s title. 5 easy steps these are not.

  • mypalmike

    For changes in database schema, you might run your db backup script as one step in deployment (you do have a db backup script on cron, don’t you?) or a similar scheme. This helps with rolling back as well as any catastrophic data loss from new bugs.

    Or if you use a schema-less db such as Neo4J, these types of changes become coding issues rather than external system management.