Leaving politics aside, there’s a lot that can be learned from the technical efforts of the Obama and Romney campaigns. Just about everyone agrees that the Obama campaign’s Narwhal project was a great success, and that Romney’s Orca was a failure. But why?
I have one very short answer. If you follow technology, you don’t have to read between the lines much to realize that Narwhal embodied the best of the DevOps movement: rapid iteration, minimal barriers between developers and operations staff, heavy use of “cloud” technology, and constant testing to prove that you can handle outages and heavy load. In contrast, Romney’s Orca was a traditional corporate IT project gone bad. The plans were developed by non-technical people (it’s not clear what technical expertise Romney had on his team), and implemented by consultants. There were token testing efforts, but as far as I could tell, no serious attempts to break or stress the system so the team understood how to keep it running when the going got tough.
It’s particularly important to look at two factors: the way testing was done, which I’ve already mentioned, and the use of cloud computing. While Orca was “tested,” there is a big difference between passing automated test suites and the sort of game day exercise that the Narwhal team performed several times. In a game day, you’re actively trying to break the system in real time: in this case, a fully deployed copy of the actual system. You unplug the routers; you shut down the clusters; you bombard the system with traffic at inconceivable volumes. And the people responsible for keeping the system up are on the team that built it, and the team that ran it in real life. If you read the Atlantic account of Narwhal’s game day, you’ll see that it involved everyone, from the most senior developers on down, sweating it out to keep the system running in the face of disaster. They even simulated what would happen if one of Amazon’s availability zones went down, as has happened several times in the past few years (and happened again a few days before the election). Game day gave the Obama team a detailed plan for just about every conceivable outage, and the confidence that, if something inconceivable happened, they could handle it.
You never get the sense that Orca was tested in the same way. If it had, the Romney team would have had a plan for what happened when network outages occurred, or when the server load went critical. I don’t see any evidence that the consultants who wrote the code were involved when operational problems started to show up. There were minimal plans for backup or disaster recovery, and on election day, they found out that the network couldn’t take the load. The Romney campaign had someone asking the right questions; but he didn’t get any answers.
Narwhal’s use of Amazon Web Services was another significant advantage. “In the cloud” is a cliche, but the capabilities Amazon provided were anything but. The Narwhal team didn’t have to worry about running out of compute capacity, because they could start more server instances as needed. Their disaster strategy included maintaing a hot backup of their applications in Amazon’s Western zone, in case the Eastern zone failed. Because they were relying on Amazon’s network services, network capacity wasn’t a concern. Amazon’s network handles Black Friday and Christmas with ease, along with many popular Internet sites. Election day wasn’t a challenge for their network.
While nobody knows exactly what the Orca team did, it’s believed that they were operating out of a single data center, either in Boston Garden or nearby, running on a fixed set of servers. This looks very much like a traditional IT operation, where the computing facilities are all owned and either on-premises or at a colocation facility. Arguably, this gives you increased control and security, though I believe that these advantages are a mirage. I’d much rather have Amazon’s staff fighting attempts to compromise their infrastructure than anyone I could afford to hire. The downside is that the Romney team wasn’t able to add capacity as load increased on election day. It’s just not possible to acquire and integrate new hardware on that time scale. Furthermore, by concentrating their servers at a single location, the Orca team effectively concentrated all their traffic, leading to outages when load exceeded capacity.
I’ve seen studies claiming that 68% of IT projects fail, so the end result isn’t a big surprise. I don’t believe that Orca’s failure was the determining factor in the election. But it is a cautionary tale for anyone working in IT, whether at a web startup or a large, traditional IT shop. Separating developers from the operations staff responsible for running the system is inviting trouble. Consulting contracts that extend beyond development to deployment and operations aren’t unknown, but neither are they common. DevOps is impossible when the devs have met their contractual obligations and have left the building. Inadequate testing, particularly stress testing of the entire system, is a further step in the wrong direction. And while cloud computing in itself doesn’t prevent or forestall disaster, building an IT infrastructure that runs entirely on-premises denies you the flexibility you need to deal with the problems that will inevitably arise.
What will we see in the 2016 election? By then, DevOps may well sound staid and corporate, and we’ll be looking forward to the next trendy thing. But I can guarantee that the major campaigns in the next presidential election will have learned the lessons of 2012: take advantage of the cloud, don’t separate your development staff from operations, and Test All The Things. If you’re working in IT now, you don’t have to wait to put those lessons into practice.