Websites go down. It happens. But in many cases it might be possible to deal with and explain a failure while keeping user frustration to a minimum.
Mike Brittain (@mikebrittain), director of engineering at Etsy, addressed the resilient user experience in our recent interview. Among his insights from the full interview (below):
- Designing an experience that can adapt to individual service failures and partial degradations requires an intermingling between software engineers, operations teams and product and design teams.
- Previous experience designing for cable-connected devices may skew our connectivity expectations when it comes to more fragile mobile networks.
Our full interview follows.
What is a “resilient” user experience — and what are a few of the main practices involved in ensuring an acceptable UX during an outage?
Mike Brittain: Resilient user experiences are adaptable to individual failure modes within the system — allowing users to continue to use the service even in a partially degraded scenario.
Large-scale websites are driven by numerous databases, APIs, and other back-end services. Without thoughtful application design, any failure in an individual service might bubble up as a generic “Server Error.” This sort of response completely blocks the user from any further experience and has the potential to degrade the user’s confidence in your website, software or brand.
Consider an article page on the New York Times’ website. There is the primary content of the page: the body of the article itself. And then there are all sorts of ancillary content and modules, such as social sharing tools, personalization details if you’re signed-in, comments, articles recommended for you, most emailed articles, advertisements, etc. If something were to go wrong while retrieving the primary content for the page — the article body — you might not be able to provide anything meaningful to the reader. But if one or more services failed for generating any of those ancillary modules, it’s likely to have a much lower impact on the reader. So, a resilient design would allow for any of those individual modules to fail gracefully without blocking the reader from completing the primary action on the site — reading news.
Here’s another example closer to my own heart: The primary action for visitors to Etsy is to find, review, and purchase handcrafted goods. A product page on Etsy.com includes all sorts of ancillary information and tools, including a mechanism for marking a product as a “favorite.” If the Favorites system goes down, we wouldn’t want to return an error page to the visitor. Instead, we would hide the tool altogether. Meanwhile, visitors can continue to find and purchase products during this degradation. In fact, many of them may be blissfully unaware that the feature even exists while it is unavailable.
In the DevOps culture, we see increasing intermingling of experience and knowledge between software engineers and operations teams. Engineers who understand well how their software is operated, and the interplay between various services and back-ends, often understand failure modes and can adapt. Their software and hardware architecture may take advantage of patterns like redundant services, failover services, or retry attempts after failures.
Resilient user experiences require further intermingling with product and design teams. Product design is focused almost entirely on user experience when the system is assumed to be working properly. So, we need to have product designers commingling with engineers to better understand individual failure modes and to plan for them.
Do these interface practices vary for desktops/laptops versus mobile or tablets?
Mike Brittain: These principles apply to any user interface. But as we move into using more mobile devices and networks, we need to consider the relative fragility of the network that connects our software (e.g. a smartphone app) to servers on the Internet.
Our design process may be hampered by our prior experiences in which computers and web browsers connected to the Internet by physical cables suffered relatively low network failure rates. As such, our expectations may be that the network is seldom a failure point. We’re moving rapidly into a world where mobile software connects to back-end services over cellular data networks — not to mention that the handset may be moving at high speed by car or train. So, we need to design resilience into our UIs anywhere we depend on network availability for data.
How do you set up a front-end to fail gracefully?
Mike Brittain: Front-end could mean client-side, or it could refer to the forward-most server-side script in the request flow, which talks to other back-end services to generate the HTML for a web page. Both situations are valid for resilient design.
In designing resilient UIs, you expect failures in each and every back-end service. Examples might include connection failures, connection timeouts, response timeouts, or corrupted/incomplete data in a response. A resilient UI traps these failures at a low level and provides a usable response, rather than throwing a general exception that causes the entire page to fail.
On the client side, this could mean detecting failures in Ajax responses and allowing the user experience to continue unblocked, or by retrying after a given amount of time. This could be during page render, or maybe during a user interaction. Those familiar with Gmail may recognize that during periods of network congestion or back-end failures, the small status message that reads, “sending,” when you send an email sometimes changes to “still trying ” or “offline.” This is preferred over a general “failed to send email” after a single attempt.
Some general patterns for resilient UI include:
- Disable or hide features that are failing.
- Provide fallback (default) content in place of dynamic content or feature that cannot be reached or displayed.
- Avoid behaviors that block UI display or interaction.
- Detect service failures and allow for retries.
- Failover to redundant services.
Systems engineers may recognize these patterns in low-level services or protocols. But these patterns are not as familiar to front-end engineers, product developers, and designers — who plan more around success than around failure. I don’t mean for that statement to be divisive, but I do think it’s true of the current state of how we build software and how we build the web.
How do you make your community aware of a failure?
Mike Brittain: In the case of small failures, the idea is to obscure the failure in a way that it does not block the primary use case for the site (e.g. we don’t shut down product pages because the Favorites service is failing). Your community may not need much communication around this.
When things really go wrong, you want to be upfront and clear about failures. Use specific terms, rather than general. Provide context of time and estimated time to resolution whenever possible. If you have a service that fails and will be unavailable until you restore data over a period of, say, three hours, it’s better to tell your visitors to check back in three hours than to have them hammering the refresh button on their browser for 20 minutes as they build up frustration.
You want to make sure this information is within reach for your users. I actually think at Etsy we have some pretty good patterns for this. We start with a status blog that is hosted outside of our primary network and should be available even if our data center is unreachable. Most service warnings or error messages on Etsy.com will provide a link to this blog. And anytime we have a service outage posted to this blog, a service warning is automatically posted at the top of any pages within our community forums and anywhere else that members would go looking for help on our site.
In your Velocity 2012 keynote summary, you mention “validating failure scenarios with ‘game days’.” What’s a game day and how does it work?
Mike Brittain: The term game day” describes an exercise that tests some failure scenario in production. These drills are used to test hypotheses about how our systems will react to specific failures. They also surface any surprises about how the system reacts while we are actively observing.
We do this in production because development, testing, and staging environments are seldom 100% symmetric with production. You may have different numbers of machines, different volumes of data, or simulated versus live traffic. The downside is that these drills will impact real visitors. The upside is that you build real confidence within your team and exercise your abilities to cope with real failures.
We regularly test configuration flags across our site to ensure that we haven’t unwired configuration logic for features we have been patching or improving. We also want to confirm that the user experience degrades gracefully when the flag is turned off. For example, when we disable the Favorites service on our site, we expect reads and writes to the data store to stop and we would expect various parts of the UI to hide the Favorites tools. Our game day would allow us to prove these out.
We would be surprised to find that disabling Favorites causes entire pages on the site to fail, rather than to degrade gracefully. We would be surprised if some processes continued to read from or write to the service while the config flag was disabled. And we would be further surprised to find unrelated services failing outright when the Favorites service was disabled. These are scenarios that might not be observed by simulated testing outside of production.
This interview was edited and condensed.
Associated photo on home and category pages: 404 error message something went wrong by IvanWalsh.com, on Flickr