Velocity 2013 Speaker Series
Failure Isolation and Operations with Hystrix
Web-scale applications such as Netflix serve millions of customers using thousands of servers across multiple data centers. Unmitigated system failures can impact the user experience, a product’s image, and a company’s brand and, potentially, revenue. Service-oriented architectures such as these are too complex to completely understand or control and must be treated accordingly. The relationships between nodes are constantly changing as actors within the system independently evolve. Failure in the form of errors and latency will emerge from these relationships and resilient systems can easily “drift” into states of vulnerability. Infrastructure alone cannot be relied upon to achieve resilience. Application instances, as components of a complex system, must isolate failure and constantly audit for change.
At Netflix, we have spent a lot of time and energy engineering resilience into our systems. Among the tools we have built is Hystrix, which specifically focuses on failure isolation and graceful degradation. It evolved from a series of production incidents involving saturated connection and/or thread pools, cascading failures, and misconfigurations of pools, queues, timeouts, and other such “minor mistakes” that led to major user impact.
This open source library follows these principles in protecting our systems when novel failures inevitably occur:
- Isolate client network interaction using the bulkhead and circuit breaker patterns.
- Fallback and degrade gracefully when possible.
- Fail fast when fallbacks aren’t available and rapidly recover.
- Monitor, alert and push configuration changes with low latency (seconds).
Restricting concurrent access to a given backend service has proven to be an effective form of bulkheading, as it limits the resource utilization to a concurrent request limit smaller than the total resources available in an application instance. We do this using two techniques: thread pools and semaphores. Both provide the essential quality of restricting concurrent access while threads provide the added benefit of timeouts so the caller can “walk away” if the underlying work is latent.
Isolating functionality rather than the transport layer is valuable as it not only extends the bulkhead beyond network failures and latency, but also those caused by client code. Examples include request validation logic, conditional routing to different or multiple backends, request serialization, response deserialization, response validation, and decoration. Network responses can be latent, corrupted, or incompatibly changed at any time, which in turn can result in unexpected failures in this application logic.