Errors in Database Systems Still Must Consider Network Partitions
Prof. Michael Stonebraker wrote a paper, published in the April 2010 issue of CACM, entitled “Errors in Database Systems: Eventual Consistency and the CAP Theorem”. As I see it, the overall point of the paper is that the kinds of failures that cause partition-tolerance problems are rare, and not too significant compared to the other ways that a DBMS can fail. Therefore, people are worrying too much about the CAP issue, specifically network partitions.
The paper enumerates eight causes of DBMS failure as seen by an application, such as software errors in the DBMS itself, operating system errors, and so on. Number six in his list is “a network partition in a local cluster”, and number eight is “a network failure in the WAN connecting clusters together; the WAN failed and clusters can no longer all communicate with each other”. (The usual reason one would have multiple clusters connected by a WAN is for “disaster recovery”, i.e. dealing with a problem that causes an entire cluster to fail, such as power loss over all hardware on which the cluster depends.)
Number six is the crucial issue in the paper, as far as the CAP issue goes. He has two answers:
- In my experience, this is exceedingly rare, especially if one replicates the LAN (as Tandem did).
- The overwhelming majority [of local failures] cause a single node to fail, which is a degenerate case of a network partition that is easily survived by lots of algorithms.
About #1, I have spent some time talking to the operations architects at ITA Software, which has run high-availability servers for many years now, and heard about their experience. It depends on what you mean by a “LAN”. If you mean a few computers connected together by an Ethernet, with redundant hardware all around, then the chance of a failure of the network itself is relatively low. However, real-world data centers with a relatively large number of servers rarely work this way. The problem is that a real network is very complicated. It depends on switches at both level 3 (routers) and level 2 (hubs). Situations can arise in which pieces of the network are mis-configured by accident; these can be hard to find due, ironically, the very redundancy that was added to avoid failures. In particular, there is no way to make any kind of guarantee about the latency within the network, nor the likelihood that a packet will make it from its source to its destination.
About #2, it would have been helpful if there were citations in the paper. It is hard to reply to such a claim without specifics. One of the techniques to deal with one server being down involved “quorums”, but they can introduce problems with high-availability.
But, more importantly, consider some of the failure modes that Amazon’s “Dynamo” highly-available key-value store is built to deal with. Suppose we have a key-value pair that resides on two servers, for high availability. Call them A1 and A2. An application changes the value of the pair, but does so at a time with A1 is down or unreachable. So the update is made to replica A2. Later, a second application reads the value associated with the key, but this time server A1 is down or unreachable, and server A2 is available. The second application might not see the new value written by the first application. I don’t know any of “lots of algorithms” that deals with this sort of scenario while providing complete consistency/correctness.
The conclusion of the paper is: “In summary, one should not throw out the C so quickly, since there are real error scenarios where CAP does not apply and it seems like a bad tradeoff in many of the other situations.” But since network-partition failures really can happen, it’s not clear that one can simply decide not to throw out the consistency/correctness criterion.
December 14th, 2010 at 9:47 am
I agree with the lack of citations but generally what I think “lots of algorithms” refers to have something to do with various families of algorithms that fall into certain “categories”:
1. Gossip-based or Quorum-based validation techniques (similar to what the Dynamo developers do)
2. Invalidation protocols — for example, a node will know that it just recovered from a failure and update itself from potential updates that have occurred in the state of the storage system since the last time it was up; all queries to the node where data hasn’t been validated to be “up to date” will be referred to the fail-over nodes in the group.
3. Canonical replication — this works the DNS way where you have eventual consistency, and tolerate a certain level of “staleness” of data in the intermediary caches. Of course, you cannot guarantee global consistency in this type of system at any given time, but mostly if the data and the access patterns warrant it, you’d have a canonical source of the data, and a means of propagating changes out to intermediary nodes.
I’m not sure if there are a lot of publications out there about the types of algorithms I mention above, but I can imagine there would be a lot of these types of algorithms. I may be wrong though.
December 14th, 2010 at 10:12 am
The conclusion here is a bit of a straw-man argument. Nobody is advocating that people decide to throw out anything, or that you can ignore part of CAP.
Step 1: Figure out how much the network partitions that will cause your system to be unavailable will cost.
Step 2: Figure out how much giving up consistency will cost.
Step 3: Make your CAP decision base on these estimated values.
Stonebraker’s main point is that people often overestimate the first value and underestimate the second.
I’ll personally add that that doesn’t mean keeping consistency is always cheaper. Also, these values are basically impossible to accurately estimate. The best we can do is guess based on experience, and everyone has different “disaster-baggage” they bring to a new project.
December 14th, 2010 at 2:17 pm
Hey dad! I just found your blog
I’m going to have to ask you about this when I get home.
December 23rd, 2010 at 10:09 pm
@John: I stand by my statement that dismissing network partitions in a LAN as “exceedingly rare” is not reasonable, at least if your LAN goes beyond a very simple configuration. I much prefer your sort of statement. We should add to it another step, in which you determine how likely a partition is (and how long it is to last). I wish that the paper had been written to make its statement in this way. No matter how hard the numbers are to estimate, it’s important to know how to characterize the problem that way. As you say, with more experience, the numbers can get more credible. Thanks for your comments!
April 18th, 2011 at 8:05 am
Alex Feinberg, a comment to another blog item herein, referenced this excellent paper, which made many of the same points I’m making. Had I known about it at the time, I would have cited it. It’s very well-written! It’s from Henry Robinson, posted on the Cloudera blog:
http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/