The DevOps Audit Defense Toolkit tries to make a case to an auditor for Continuous Deployment in a regulated environment: that developers, following a consistent, disciplined process, can safely push changes out automatically to production once the changes pass all of the reviews and automated tests and checks in the CD pipeline.
Continuous Deployment has been made famous at places like Flickr, IMVU (where Eric Ries developed the ideas for the Lean Startup method), and Facebook:
Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and [are] applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of the system and surely [fix] any bugs that would result from these frequent small changes.1
While organizations like Etsy and Wealthfront work hard to make Continuous Deployment safe, it is scary to auditors, to operations managers, and to CTOs like me who have been working in financial technology and understand the risks involved in making changes to a live, business-critical system.
Continuous Deployment requires you to shut down a running application on a server or a virtual machine, load new code, and restart. This isn’t that much of a concern for stateless web applications with pooled connections, where browser users aren’t likely to notice that they’ve been switched to a new environment in Blue-Green deployment.2 There are well-known, proven techniques and patterns for doing this that you can follow with confidence for this kind of situation.
But deploying changes continuously during the day at a stock exchange connected to hundreds of financial firms submitting thousands of orders every second and where response times are measured in microseconds isn’t practical. Dropping a stateful FIX session with a trading counterparty and reconnecting, or introducing any kind of temporary slowdown, will cause high-speed algorithmic trading engines to panic. Any orders that they have in the book will need to be canceled immediately, creating a noticeable effect on the market. This is not something that you want to happen ever, never mind several times in a day.
It is technically possible to do zero-downtime deployments even in an environment like this, by decoupling API connection and session management from the business logic, automatically deploying new code to a standby system, starting the standby and primary systems up, and synchronizing in-memory state between the systems, triggering automated failover mechanisms to switch to the standby, and closely monitoring everything as it happens to make sure that nothing goes wrong.
But do the benefits of making small, continuous changes in production outweigh the risks and costs involved in making all of this work?
During trading hours, every part of every financial market system is expected to be up and responding consistently, all the time. But unlike consumer Internet apps, financial systems don’t need to run 24/7/365. This means that most financial institutions have maintenance windows where they can safely make changes. So why not continue to take advantage of this?
Some proponents of Continuous Deployment argue that if you don’t exercise your ability to continuously push changes out to production, you cannot be certain that it will work if you need to do it in an emergency. But you don’t need to deploy changes to production 10 or more times per day to have confidence in your release and deployment process. As long as you have automated and standardized your steps, and practiced them in test and exercised them in production, the risks of making a mistake will be low.
Another driver behind Continuous Deployment is that you can use it to run quick experiments, to try out ideas for new features or to evaluate alternatives through A/B testing. This is important if you’re an online consumer Internet startup. It’s not important if you’re running a stock exchange or a clearinghouse. While a retail bank may want to experiment with improvements to its consumer website’s look and feel, most changes to financial systems need forward planning and coordination, and advance notice — not just to operations, but to partners and customers, to compliance and legal, and often to regulators.
Changes to APIs and reporting specifications have to be certified with counterparties. Changes to trading rules and risk management controls need to be approved by regulators in advance. Even algorithmic trading firms that are constantly tuning their models based on live feedback need to go through a testing and certification process when they make changes to their code.
In order to minimize operational and technical risk, financial industry regulators are demanding more formal control over and transparency in changes to information systems, not less. New regulations like Reg SCI and MiFID II require firms to plan out and inform participants and regulators of changes in advance; to prove that sufficient testing and reviews have been completed before (and after) changes are made to production systems; and to demonstrate that management and compliance are aware of, understand, and approve of all changes.
It’s difficult to reconcile these requirements with Continuous Deployment — at least, for heavily regulated core financial transaction processing systems. This is why we focus on Continuous Delivery in this book, not Continuous Deployment.
Both approaches leverage an automated testing and deployment pipeline, with built-in auditing. With Continuous Delivery, changes are always ready to be deployed — which means that if you need to push a fix or patch out quickly and with confidence, you can. Continuous Delivery also provides a window to review, sign off on, and schedule changes before they go to production. This makes it easier for DevOps to work within ITIL change management and other governance frameworks, and to prove to regulators that the risk of change is being managed from the top down. Continuous Delivery puts control over system changes clearly into the hands of the business, not developers.
In Blue-Green deployment, you run two production environments (“blue” and “green”). The blue environment is active. After changes are rolled out to the green environment, customer traffic is rerouted using load balancing from the blue to the green environment. Now the blue environment is available for updating.↩