Operations is a competitive advantage… (Secret Sauce for Startups!)

My lunchtime conversations at the Summit centered around Operations as a competitive advantage (and occasionally a “strategic weapon”). This advantage is the ability to consistently create and deploy reliable software to an unreliable platform that scales horizontally.

Many people think of Operations as “a bunch of boring work… which I’m hoping someone else is doing.” It often takes less time to set up a development environment than the tools and infrastructure needed to test, deploy, monitor, and scale new software. The survival of most projects depend on working software, at least initially, and so if there is money or time many people will spend it on development. Unfortunately, people say they will “figure that ops stuff out soon”, but what they mean is “when we’re totally screwed!!!” It doesn’t have to be that way…


The example above is the tale of two Web 2.0 startups scaling to 20 systems during their first three months. The first team starts writing software and installing systems as they go, waiting to deal with the “ops stuff” until they have an “ops person”. The second team dedicates someone to infrastructure for the first few weeks and ramps up from there. They won’t need to hire an “ops person” for a long time and can focus on building great technology.

In my experience it takes about 80 hours to bootstrap a startup. This generally means installing and configuring an automated infrastructure management system (puppet), version control system (subversion), continuous build and test (frequently cruisecontrol.rb), software deployment (capistrano), monitoring (currently evaluating Hyperic, Zenoss, and Groundwork). Once this is done the “install time” is reduced to nearly zero and requires no specialized knowledge. This is the first ingredient in “Operations Secret Sauce“.

This kind of scalability becomes really interesting when you find yourself suddenly popular, as iLike did when it launched its Facebook app and had to scale up fast (Radar):

In our first 20 hours of opening doors we had 50,000 users sign up, and it is only accelerating. (10,000 users joined in the first 12 hrs. 10,000 more users in the next 3 hrs. 30,000 more users in the next 5 hrs!!)

We started the system not knowing what to expect, with only 2 servers, but ready with backup. Facebook’s rabid userbase chewed up our 2 servers almost instantly. We doubled our capacity to catch up. And then we doubled it again. And again. And again. Oh crap – we ran out of servers!! Although iLike.com has a very healthy level of Web traffic, and even though about half of all the servers in our datacenter were sitting unused, idle, as backup capacity, we are now completely maxed out.

We just emailed everybody we knjow across over a dozen Bay Area startups, corporations, and venture firms in a desperate plea to find spare servers so we can triple our capacity for the continued onslaught. Tomorrow we are picking up over 100 servers from different companies to have them installed just to handle the weekend’s traffic. (For those who responded to our late night pleas, thank you!)

Not being able to acquire hardware fast enough is by far a better problem than not being able to install it. iLike is something of a poster-child for puppet.

Are any VCs out there including effective operations in their due-dilligence? Are startups incorporating this in their pitch? (Amazon seems to be pushing this as part of the AWS “Start-Up Project” if you’re using S3 and EC2)

Update: Luke points out Adam Jacob’s post about implementing Puppet for iLike. (Disclosure: I’m discussing collaboration with Adam’s company, HJK solutions.)

Update #2: John Allspaw of Yahoo/Flickr fame has great commentary on procurement and capacity management challenges for successful startups.

Adam says:

Puppet enables us to get a huge jump-start on building automated,
scaleable, easy to manage infrastructures for our clients. Using
puppet, we:

  1. Automate as much of the routine systems administration tasks as possible.
  2. Get
    10 minute unattended build times from bare metal, most of which is data
    transfer. Puppet takes it the rest of the way, getting the machines
    ready to have applications deployed on them. It’s down to two and a
    half minutes for Xen.
  3. Bootstrap
    our clients production environments while building their development
    environment. I can’t stress how cool this really is. Because we are
    expressing the infrastructure at a higher level, when it comes time to
    deploy your production systems, it’s really a non-event. We just roll
    out the Puppet Master and an Operating System auto-install environment,
    and it’s finished.
  4. Cross-pollinate between clients with similar architectures. We work with several different shops using Ruby on Rails,
    all of whom have very similar infrastructure needs. By using Puppet in
    all of them, when we solve a problem for one client, we’ve effectively
    solved it for the others. I love being able to tell a client that we
    solved a problem for them, and all it’s going to cost is the time it
    takes for us to add the recipe.

Sounds good to me.

tags: , , , , , , ,

Get the O’Reilly Web Ops and Performance Newsletter

Weekly insight from industry insiders. Plus exclusive content and offers.