Retreading old topics can be a powerful source of epiphany, sometimes more so than simple extra-box thinking. I was a computer science student, of course I knew statistics. But my recent years as a NoSQL (or better stated: distributed systems) junkie have irreparably colored my worldview, filtering every metaphor with a tinge of information management.
Lounging on a half-world plane ride has its benefits, namely, the opportunity to read. Most of my Delta flight from Tel Aviv back home to Portland lacked both wifi and (in my case) a workable laptop power source. So instead, I devoured Nate Silver’s book, The Signal and the Noise. When Nate reintroduced me to the concept of statistical overfit, and relatedly underfit, I could not help but consider these cases in light of the modern problem of distributed data management, namely, operators (you may call these operators DBAs, but please, not to their faces).
When collecting information, be it for a psychological profile of chimp mating rituals, or plotting datapoints in search of the Higgs Boson, the ultimate goal is to find some sort of usable signal, some trend in the data. Not every point is useful, and in fact, any individual could be downright abnormal. This is why we need several points to spot a trend. The world rarely gives us anything clearer than a jumble of anecdotes. But plotted together, occasionally a pattern emerges. This pattern, if repeatable and useful for prediction, becomes a working theory. This is science, and is generally considered a good method for making decisions.
On the other hand, when lacking experience, we tend to over value the experience of others when we assume they have more. This works in straightforward cases, like learning to cook a burger (watch someone make one, copy their process). This isn’t so useful as similarities diverge. Watching someone make a cake won’t tell you much about the process of crafting a burger. Folks like to call this cargo cult behavior.
How Fit are You, Bro?
You need to extract useful information from experience (which I’ll use the math-y sounding word datapoints). Having a collection of datapoints to choose from is useful, but that’s only one part of the process of decision-making. I’m not speaking of a necessarily formal process here, but in the case of database operators, merely a collection of experience. Reality tends to be fairly biased toward facts (despite the desire of many people for this to not be the case). Given enough experience, especially if that experience is factual, we tend to make better and better decisions more inline with reality. That’s pretty much the essence of prediction. Our mushy human brains are more-or-less good at that, at least, better than other animals. It’s why we have computers and Everybody Loves Raymond, and my cat pees in a box.
Imagine you have a sufficient amount of relevant datapoints that you can plot on a chart. Assuming the axes have any relation to each other, and the data is sound, a trend may emerge, such as a line, or some other bounding shape. A signal is relevant data that corresponds to the rules we discover by best fit. Noise is everything else. It’s somewhat circular sounding logic, and it’s really hard to know what is really a signal. This is why science is hard, and so is choosing a proper database. We’re always checking our assumptions, and one solid counter signal can really be disastrous for a model. We may have been wrong all along, missing only enough data. As Einstein famously said in response to the book 100 Authors Against Einstein: “If I were wrong, then one would have been enough!”
Database operators (and programmers forced to play this role) must make predictions all the time, against a seemingly endless series of questions. How much data can I handle? What kind of latency can I expect? How many servers will I need, and how much work to manage them?
So, like all decision making processes, we refer to experience. The problem is, as our industry demands increasing scale, very few people actually have much experience managing giant scale systems. We tend to draw our assumptions from our limited, or biased smaller scale experience, and extrapolate outward. The theories we then tend to concoct are not the optimal fit that we desire, but instead tend to be overfit.
Overfit is when we have a limited amount of data, and overstate its general implications. If we imagine a plot of likely failure scenarios against a limited number of servers, we may be tempted to believe our biggest odds of failure are insufficient RAM, or disk failure. After all, my network has never given me problems, but I sure have lost a hard drive or two. We take these assumptions, which are only somewhat relevant to the realities of scalable systems and divine some rules for ourselves that entirely miss the point.
In a real distributed system, network issues tend to consume most of our interest. Single-server consistency is a solved problem, and most (worthwhile) distributed databases have some sense of built in redundancy (usually replication, the root of all distributed evil).
Another common assumption is that server clocks can be used for causality. The idea that you can synchronize multiple servers, and then use that “universal clock” to timestamp data. This faulty theory is a simple extrapolation of running code on one server. If a timestamp on server A is older than a timestamp on server B, the hypothesis goes, the former must be older. But anyone who has managed thousands of systems knows that clocks can, and do, get out of sync. Sure, servers can query a service to stay approximately together, but this is little more than a technical version of the classic spy movie “synchronize watches” routine, where everyone puts their wristwatches mid screen and sets midnight at once. It’s imprecise at best. At worse services fail. Packets are lost. Clocks drift. This is why logical clocks (eg, Lamport timestamps, or vector clocks) exist. Believe me, we certainly don’t make these things for fun.
There are dozens of these assumptions concerning large scale, distributed databases, but let me cover one more. The idea that network partitions are so rare they can be safely ignored. Strangely, some distributed database implementations themselves are designed in this way, and actively promote this theory. A network partition is simply the possibility that a message between two computers can be lost. There are so many ways this can happen it’s almost laughable to try. I’m sure you’ve thought of three already. Signal loss can be anywhere from physical (tripping on a wire), a crashed server, an overloaded system (ignoring new requests), bad configuration (wrong IP address), and so on. Even worse, you never know why or when a packet was lost. Was a message from server A to B lost from A to B? Or was the response from B back to A lost? Is B even still running, but only accepting requests from other servers (I like to call this the “unfaithful spouse”)? From A’s point of view, these are all the same event: silence.
This is another case of an insufficient worldview being extrapolated onto the large scale. It’s true that if servers are sitting physically together in the same cabinet there are fewer chances of partitions happening (they still can… have you ever written a bad iptables entry?). But this probably can’t be guaranteed in a virtualized environment, and is made worse when dealing with multiple datacenters.
If you want to see what can happen when partitions happen on real systems, I highly recommend checking out the Jepsen series by Kyle Kingsbury. Simple packet loss can wreak havoc on all manner of distributed databases, even those who like to make strong (or any, for that matter) consistency claims. My simple advice is to ensure your collection of information is based on science, not marketing (and certainly not yet another one day workshop, which lay somewhere between marketing and brainwashing).
Cargo Cult Operations
If a misfit theory is one source of database operators’ bad decisions, and we can’t easily know what we don’t know (thanks Rumsfeld), how can we ever choose? This leads me to the next most common method of choosing a database: cargo culting. It’s a common story, so I’ll be fast for those unaware.
During World War 2, remote Melanesian islands were often used as airforce and navy bases. These islands were inhabited, however, by indigenous people who had never come into contact with such technologically advanced civilizations. From their point of view, airplanes dropped cargo from the heavens, bringing food, clothing, medicine, and other modern accouterments. As the war wound down, those bases were abandoned and the airdrops stopped. But the islanders, craving the sweet bounty of the offlanders, decided to fix their problem by mimicking their behaviors. They built mock air control towers, wore the soldiers’ uniforms, and mimicked the movements they saw. This included marches while carrying carved guns, lighting fire signals, and waving planes to land in the runways with sticks.
Labeling some action as cargo cult behavior is a metaphor for (imprecisely) copying the behavior of someone else without understanding the underlying purpose, while expecting the same result. In the database world this happens everywhere from copying operations strategies, to query techniques, to outright cloning a database design. Some of my favorite common cases I’ve seen are (in classic straw man conversational form): “Google uses BigTable, I want to be Google, I’ll use HBase (OSS based on BigTable).” “Facebook created Cassandra, Facebook has smart engineers, ergo, Cassandra must be smart.” “Huge companies query by map/reduce so that’s the only way to query big data.” And so on. Just like overfitting your data, cargo culting is based on an incomplete understanding of the tradeoffs required to solve your problem.
Google is huge; unbelievably, massively, gigantically huge. They have resources most of us don’t, as well as have (I hate to put it this way) esoteric knowledge most of us don’t have. Simply watching the small amount of information they let out will never give you a full picture of their overall solution. While everyone was rushing to implement BigTable and MapReduce, Google was secretly working on MegaStore to deal with the weaknesses of that model. By the time they published information about MegaStore, they were already implementing Spanner. Now that the world knows about Spanner, you can be damned sure they are dealing with those problems.
Another problem with cargo culting is that large operations know exactly what to buy, because they have very predicable growth. Moreover, they can leverage economies of scale, meaning, they can get resources more cheaply than you because they buy more. But you don’t. So in an attempt to copy LinkedIn, you will likely end up misallocating resources. You may spend money on the beefiest Joyent VMs, but don’t properly consider bandwidth costs. You may allocate a lot for provisioned IOPS on EC2, without considering that your Redis cluster doesn’t need that level of speed for occasional snapshotting.
This occurs on the smaller scale too. Although one company may have been using MongoDB successfully for their three-node cluster, does not imply you will have the same success for your thirty nodes. They may desire ease of development over operational complexity. You may not. On the flipside, you may have a different data model entirely, perhaps highly interconnected data best served by graph or triple datastore, rather than a key/value DB like Riak.
I know, I just spent pages pointing out land mines. That was the point. If you read most posts about NoSQL databases, you’ll get pages of over-confidence or under- confidence about many details from horizontal scalability, predictable growth, simplicity of development, high availability or strong consistency. You’ll even hear insane statements like “I’ve disproven the CAP theorem” or nonsensical ones like “Mongo is webscale.” So how does one choose a strategy? By a combination of the Feynman Problem Solving Algorithm (1. Write down the problem; 2. Think very hard; 3. Write down the answer), and testing your ideas.
If you are very confident that your experience (datapoints, whether self gathered or studied) are sufficient to make a decision, you will concoct your own theory to work from. Good thinkers will pay attention to relevant signals and ignore the noise, but it requires enough information to work from. This is further complicated, because not all evidence is culpable. Bad thinkers will overfit limited or irrelevant experience into a flawed theory, or worse, craft one without material information whatsoever. Be wary of overconfidence.
Those lacking in confidence will not trust their own experience, and instead opt to watch what others do. Good observers will study the actions of a similar organization with a similar problem set. Bad observers will simply follow the actions of the largest, or most vocal, organizations. Or worse, they’ll trust the marketing arm of a NoSQL company trying to sell something. I know, I know, I work for one of those companies–don’t trust me either. Trust in yourself.
So what’s the answer? Just like the disappointing answers of a billion moms before me (study hard, eat well, wear a helmet, don’t hit your sister), the solution is education, experience, and self-introspection. But most of all: test your theories. A few years ago that advice was a tough pill to swallow. Servers were expensive, information was sparse. But today, everything you need to verify your assumptions are there. A co-worker of mine tried this experiment two years ago: Start up a 100 node Riak cluster for $2. If you’re a company, you can’t afford to wait until production to test your theories.
Sometimes you’re working off of faulty assumptions. Sometimes you’re following the wrong leader. And sometimes, you’re just embarking into uncharted territory. But all of these ideas can and should be experimented against. Try to think of a best fit, including the advice of those who know, then test it your own damn self.