Complexity fails: A lesson from storage simplification

Storage architectures show simplicity's power and how to build clouds at scale.

The default approach to most complex problems is to engineer a complex solution. We see this in IT, generally, and in cloud computing specifically. Experience has taught us, however, that large-scale systems belie this tendency: Simpler solutions are best for solving complex problems. When developers write code, they talk about “elegant code,” meaning they are able to come up with a concise, simple, solution to a complex coding problem.

In this article, I hope to provide further clarity around what I mean by simple solutions and how they differ from more complex ones.

Simple vs. complex

Understanding simplicity seems … well, simple. Simple systems have fewer parts. Think of a nail. It’s usually a single piece of steel with one end flattened, and it does just one thing, so not much can go wrong. A hammer is a slightly more complex, yet still simple, tool. It might be made of a couple of parts, but it really has one function and not much can go wrong. In comparison to these, a power drill is a significantly more complex tool and computers are far, far, more complex. As parts increase and as the number of functions that are provided by a system increases, the complexity increases.

Related to this phenomena is the simplicity or complexity of the parts themselves. Simpler parts can be more easily assembled into more complex systems that are more reliable because their parts are simpler. Whereas putting together complex systems from complex parts leads us toward building fragile and brittle Rube Goldberg contraptions.

Complexity kills scalable systems. If you want higher uptime and reliability in your IT system, you want smaller, simpler systems that fail less often because they have simpler, fewer parts.

An example: storage

A system that is twice as complex as another system isn’t just twice as likely to fail, it’s four times as likely to fail. In order to illustrate this and to drive home some points, I’m going to compare direct-attached storage (DAS) and storage-area network (SAN) technologies. DAS is a very simple approach to IT storage. It has fewer features than SAN, but it can also fail in fewer ways.

In the cloud computing space, some feel that one of Amazon Web Services’ (AWS) weaknesses is that it provides only DAS by default. To counter this, many competitors run SAN-based cloud services only, taking on the complexity of SAN-based storage as a bid for differentiation. Yet AWS remains the leader in every regard in cloud computing, mainly because it sticks to a principle of simplicity.

If we look at DAS versus SAN and trace the path data takes when written by an application running “in the cloud,” it would look something like this figure:

DAS vs San data path
DAS vs. San data path. Click to enlarge.

(A quick aside: Please note that I have left out all kinds of components, such as disk drive firmware, RAID controller firmware, complexities in networking/switching, and the like. All of these would count here as components.)

A piece of data written by the application running inside of the guest operating system (OS) of a virtual server flows as follows with DAS in place:

1. The guest OS filesystem (FS) software accepts a request to write a block of data.

2. The guest OS writes it to its “disk,” which is actually a virtual disk drive using the “block stack” (BS) [1] in its kernel.

3. The guest OS has a para-virtualization (PV) disk driver [2] that knows how to write the block of data directly to the virtual disk drive, which in this case is provided by the hypervisor.

4. The block is passed by the PV disk driver not to an actual disk drive but to the hypervisor’s (HV) virtual block driver (VBD), which is emulating a disk drive for the guest OS.

  • At this point we have passed from the “virtualization” layer into the “physical” or real world.

5. Repeating the process for the hypervisor OS, we now write the block to the filesystem (FS) or possibly volume manager (VM), depending on how your virtualization was configured.

6. Again, the block stack handles requests to write the block.

7. A disk driver (DD) writes the data to an actual disk.

8. The block is passed to a RAID controller to be written.

Obviously, laid out like this, the entire task already seems somewhat complex, but this is modern computing with many layers and abstractions. Even with DAS, if something went wrong, there are many places we might need to look to troubleshoot the issue — roughly eight, though the steps I listed are an approximation for the purpose of this article.

SAN increases the complexity significantly, where starting at step 7, we take a completely different route:

9. Instead of writing to the disk driver (DD), in a SAN system we write a network-based block device: the “iSCSI stack,” [3] which provides SCSI [4] commands over TCP/IP.

10. The iSCSI stack then sends the block to the hypervisor’s TCP/IP stack to send it over the network.

11. The block is now sent “across the wire,” itself a somewhat complicated process that might involve things like packet fragmentation and MTU issues, TCP window size issues/adjustments, and more — all thankfully left out here in order to keep things simple.

12. Now, at the SAN storage system, the TCP/IP stack receives the block of data.

13. The block is handed off to the iSCSI stack for processing.

14. The SAN filesystem and/or volume manager determines where to write the data.

15. The block is passed to the block stack.

16. The block is passed to the disk driver to write out.

17. The hardware RAID writes the actual data to disk.

In all, adding the SAN conservatively doubles the number of steps and moving parts involved. Each step or piece of software may be a cause of failure. Problems with tuning or performance may arise because of interactions between any two components. Problems with one component may spark issues with another. To complicate matters, troubleshooting the issues may be complex, in that the guest OS might be Windows, the hypervisor could be Linux, and the SAN might be some other OS all together. All with different filesystems, block stacks, iSCSI software, and TCP/IP stacks.

Complexity isn’t linear

The problem, however, is not so much that there are more pieces, but that those pieces all potentially interact with each other and can cause issues. The problem is multiplicative. There are twice as many parts (or more) in a SAN, but that creates four times as many potential interactions, each of which could be a failure. The following figure shows all the steps as both rows and columns. Interactions could theoretically occur between each two of the steps. (I’ve blacked out the squares where a component intersects with itself because that’s not an interaction between different components.)

Complexity matrix
Click to enlarge.

This diagram is a matrix of the combinations, where we assume that the hypervisor RAID in my example above isn’t part of the SAN solution. The lighter-colored quadrant in the upper left is the number of potential interactions or failure points for DAS, and the entire diagram depicts those for SAN. Put more in math terms, there are N * (N-1) possible interactions/failures. With this DAS example, that means there are 8 * (8-1) or 56. For the SAN, it’s 240 (16 * (16-1)) minus the hypervisor RAID (16) for 224 — exactly four times as many potential areas of problems or interactions that may cause failures.

How things fail and how uptime calculations work

To be certain, each of these components and interactions have varying chances of failure. Some are less likely to fail than others. The problem is that calculating your likely uptime is just like the matrix. It’s a multiplicative effect, not additive. If you want to predict the average uptime of a 99% uptime system with two components, it’s 99% * 99% = 98% uptime.

If every component of our DAS or SAN system is rated for “five 9s” (99.999%) uptime, our calculation is as follows:

Simple vs complex numbers
Click to enlarge.

The point here is not that DAS is “four 9s” or SAN is “five 9s,” but that by adding more components, we have actually reduced our likely uptime. Simpler solutions are more robust because there are fewer pieces to fail. We have lost a full “9″ by doubling the number of components in the system.

An anecdote may bring this home. Very recently, we were visited by a potential customer who described a storage issue. They had moved from a DAS solution to a SAN solution from a major enterprise vendor. Two months after this transition, they had a catastrophic failure. The SAN failed hard for three days, bringing down their entire cloud. A “five nine” enterprise solution actually provided “two nines” that year. In fact, if you tracked uptime across year boundaries (most don’t), this system would have to run without a single failure for 10-plus years to come even close to what most consider a high uptime rating.

It’s worth noting here that another advantage of a DAS system is smaller “failure domains.” In other words, a DAS system failure affects only the local server, not a huge swath of servers as happened with my anecdote above. This is a topic I plan to cover in detail in future articles, as it’s an area that I think is also not well understood.

Large-scale cloud operators are the new leaders in uptime

Once upon a time, we all looked to the telephone system as an example of a “high uptime” system. You picked up the phone, got a dial tone, and completed your call. Nowadays, though, as the complexity of networks has increased while moving to a wireless medium, carrier uptimes for wireless networks have gone down. As data volumes increase, carrier uptimes have been impacted even further.

The new leaders in uptime and availability are the world’s biggest Internet and cloud operators: Google, Facebook, Amazon, eBay, and even Microsoft. Google regularly sees four nines of uptime globally while running one of the world’s largest networks, the largest plant of compute capability ever, and some of the largest datacenters in the world. This is not a fluke. It’s because Google and other large clouds run simple systems. These companies have reduced complexity in every dimension in order to scale up. This higher availability also comes while moving away from or eliminating complex enterprise solutions, running datacenters at extreme temperatures, and otherwise playing by new rules.

Wrapping up

What have we learned here? Simple scales. What this means for businesses is that IT systems stay up longer, have better resiliency in the face of failures, cost less to maintain and run, and are just plain better when simplified. Buying a bigger, more complex solution from an enterprise vendor is quite often a way to reduce your uptime. This isn’t to say that this approach is always wrong. There are times when complex solutions make sense, but stacking them together creates a multiplicative effect. Systems that are simple throughout work best. Lots of simple solutions with some complexity can also work. Lots of complex solutions means lots of failures, lots of operational overhead, and low uptime.

Keep it simple because simple scales while complexity fails.

Velocity 2012: Web Operations & Performance — The smartest minds in web operations and performance are coming together for the Velocity Conference, being held June 25-27 in Santa Clara, Calif.

Save 20% on registration with the code RADAR20

[1] A “block stack” is a term I use in this article to represent the piece of kernel software that manages block devices. It’s similar in concept to the “TCP/IP stack” used to manage the network for a modern operating system. There is no standard name for the “block stack” as there is for the networking software stack.

[2] Para-virtualization drivers are a requirement in all hardware virtualization systems that allow for higher performance disk and network I/O. They look like a standard operating system driver for a disk, but understand how to talk to the hypervisor in such a way as to increase performance.

[3] In this case, I use iSCSI as the SAN, but AoE, FC, or FCoE are all similar in nature.

[4] The observant reader will note that at the guest layer, we might be writing to a virtual SATA disk, then using SCSI commands over TCP/IP, and finally writing via SCSI or SATA. Each of these protocols potentially has its own issues and tuning requirements. For simplicity, I’ve left the details of the challenges there out of these examples.

Related:

tags: , , ,
  • http://www.nexenta.com Evan Powell

    Lots of respect for Randy and for the fallacy of the 9s meme. However — as Randy knows, if you have many systems that work in parallel that all have for example 5 9s then actually the reliability increases not decreases.

    This is how Amazon and others are more or less able to get to 4 9s. Their overall service reliability is HIGHER than the reliability of each component. And this why RAID sets are more reliable than single disks.

    Storage in the cloud is moving in this direction too. While we at Nexenta would not argue for a legacy SAN per Randy’s example – we would vote for the radical improvement in the simplicity and redundancy of a scale out NFS system that never ever no matter what looses the data. Silent data corruption is NOT caught by today’s clouds in general and we are hearing about 99.99% uptime clouds returning junk.

    At our foundation level of the stack, we believe we must have 100% no exception never ever loose the data reliability. Loosing the data and then trying to find the data somewhere else in the cloud if you notice you’ve corrupted or lost it (after all we stored 3 copies – hello power bill!) is not the wave of the future. It is a way to guarantee replication storms. This will particularly be the case as the cloud is used more and more for mainstream use cases as opposed to write it and forget it backups and photos.

    As we speak we at Nexenta are moving from a naive DAS like architecture at a mega cloud to one with some basic shared storage because DAS didn’t scale. For most clouds, you had better have a cloud friendly foundation for your data otherwise your operators, your customers, your power bills and your CapEx to store all the data are going to impact you.

  • http://www.cloudscaling.com Randy Bias

    Hi Evan,

    Thanks for the comments. I think you are missing a key idea. RAID is a piece of software that creates a higher level of redundancy above less reliable disk drives. This is the “design-for-failure” model that Amazon and Google take across their entire datacenter stack. Sometimes I call it the “no nines” model. Typical enterprise systems attempt to mitigate all risk and provide a “uptime target” such as five 9s. The evolving mentality in cloud is about risk acceptance and management, rather than risk mitigation. HA “pair” systems inevitably fail. RAID and load balancing are an example of a *different* mentality. Scale-out, multi-node, systems with software running on top that assumes the underlying units will all fail at some point. The software, rather than attempting to keep components from failing, accepts their failure. This model is not only simple and proven, it’s significantly more robust than running HA pairs or “clusters”, which can be technologically complex. It’s important to recognize that adding 5 9s components together does NOT increase the uptime target numbers. It’s about a different model all together.

    Of course, Nexenta knows this as you worked with us on the KT cloud architecture, which was more of a “scale-out SAN” model which you allude to here. This is also the kind of model that Amazon Web Services Elastic Block Storage uses.

    You are arguing about SAN vs. object storage in your example above, which is not my point. SAN solves a different problem than object storage systems do. Object storage systems are more akin to tape backup. There are good reasons to use either of those solutions in different ways. SAN vs. DAS is a question for the real-time storing of virtual machine instances that are running on a cloud. As you know, Amazon allows both approaches and for the cloud end-user to make that choice. As you also know, the only catastrophic failure to date at Amazon was on their SAN-like system, EBS. A catastrophic failure that is impossible with a DAS solution.

    Finally, this article is about how to build more scalable systems. Systems scale when the components are simple to begin with. Period. Building a massive single HA pair of SAN boxes will not scale to datacenter size. Only a more scale-out approach will do this. The scale-out SAN model used at KT or Amazon is valid as is the scale-out DAS approach, with the latter being even simpler and more scalable than the former. That being said, as we can see with EBS, the scale-out SAN model is necessary for certain workloads and use cases.

    I don’t think you can paint all applications and workloads with the same brush and the same data protection, availability, or scaling requirements.

    –Randy

  • Mark Mitchell

    Excellent explanation for an important subject.

    Have you noticed a trend in terms of simpler systems being faster faster to diagnose and fix after they have failed, relative to more complicated systems?

    In other words, if there was a ton of data to pour through to debug the interactions in a DAS system, would there be four tons of data in a SAN system?

    Thanks
    Mark

  • http://www.cloudscaling.com Randy Bias

    Mark,

    Yes, that is an attribute of simpler systems and simpler parts. They are easier to get visibility into and fix relative to more complicated systems.

    I don’t have a quantification like “4x” I could apply however.

    Best,

    –Randy

  • http://www.richardelling.com Richard Elling

    Hi Randy,
    Unfortunately, an availability analysis is not appropriate here. 5-9s means nothing because, beyond simple systems such as RAID arrays, the probability of the system being up approaches zero. At the extreme, consider the Internet. In modern history, at no time have all of the devices connectable to the Internet been up at the same time. Similarly, at no time have all of the devices been down. For systems at scale, the simplistic uptime/downtime ratio does not apply.
    A better approach to describing the goodness of a design or the health of the system is to measure performability. A fully functional system with many moving parts has some level of capacity that is the sum of the parts. As parts fail, the capacity of the system is reduced, but the system continues to deliver some work. In some cases, people measure this as impacted user minutes or impacted sites.
    When we design for systems to operate while degraded (eg the Internet) the trade-offs include choices that are far more complex than availability, as you clearly demonstrated. In most cases, there is a lot of grey area between millions of itty-bitty disks and a highly-reliable storage array. In either case, reliability trumps availability and the choice of components with higher reliability is rewarded by less impact of system degradation. Large systems are always degraded, so there can be a real economic incentive to achieving the right mix of capital costs vs reliability.
    Evan also offers a valid point that is missed in your analysis. The failure of PCI-based storage (HBAs or SSDs) almost always causes loss of data. These devices are often targeted at the consumer space, where data loss is an accepted fact of life. This does not have to be the case, but the economics of the market drive the behaviour. In the cloud-scale storage systems, data loss is prevented by replication and distributed data stores. This is a good thing, because it is always better to build redundancy closer to the application. The cost cannot go unnoticed, however, both the capital cost for triple redundancy (eg OpenStack Swift) and the ongoing maintenance cost of having 3x the number of storage components operating in degraded mode should be considered. As the cost-per-bit and cost-per-latency go down, the solution to the design equations can change.
    I think this is the real message here, that there is no “one true solution” but there is a number of solutions that have been demonstrated to work well. If we can remove the blinders, we can often discover a more optimal solution.